MyArxiv
Computer Vision and Pattern Recognition 157
☆ Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
☆ DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.
LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution NeurIPS 2025
Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.
comment: Accept by NeurIPS 2025
☆ See the Text: From Tokenization to Visual Reading
People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
☆ FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning
Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.
comment: Accepted at MM 2025
☆ Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework NeurIPS 2025
Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibility. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: an effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths, motivating their effective integration. Then, we introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.
comment: Accepted by NeurIPS 2025 (Poster)
☆ SAM 2++: Tracking Anything at Any Granularity
Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
comment: 8 pages, and 10 pages in Supplementary Material
☆ An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection
Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas. Early detection is vital for treatment, yet the lack of skilled radiologists underscores the need for artificial intelligence (AI)-driven screening tools. Developing reliable AI models is challenging due to the necessity for large, high-quality datasets, which are costly to obtain. To tackle this, we propose a teacher--student framework which enhances both disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head. Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection, significantly outperforming baselines. The explainability assessments also show the model bases its predictions on relevant anatomical features, demonstrating promise for deployment in clinical screening and triage settings.
comment: 16 pages, 3 figures
☆ A Geometric Approach to Steerable Convolutions
In contrast to the somewhat abstract, group theoretical approach adopted by many papers, our work provides a new and more intuitive derivation of steerable convolutional neural networks in $d$ dimensions. This derivation is based on geometric arguments and fundamental principles of pattern matching. We offer an intuitive explanation for the appearance of the Clebsch--Gordan decomposition and spherical harmonic basis functions. Furthermore, we suggest a novel way to construct steerable convolution layers using interpolation kernels that improve upon existing implementation, and offer greater robustness to noisy data.
☆ ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP
comment: 17 pages, 5 fiugres
☆ Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed -- without per-scene retraining or parameter tuning -- has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel "Rebellious Student" framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network -- the rebellious student -- is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Extensive experiments on the HAD100 benchmark show substantial improvements over several established baselines with minimal computational overhead, confirming the effectiveness and generality of the proposed complementary learning paradigm. Our code is publicly available at https://github.com/xjpp2016/FERS.
☆ UltraGen: High-Resolution Video Generation with Hierarchical Attention
Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
☆ Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction NeurIPS 2025
As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model's accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated inpainting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.
comment: 10 pages, 9 figures. Accepted at the NeurIPS 2025 Workshop on Tackling Climate Change with Machine Learning
☆ Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation
Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.
☆ SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery NeurIPS 2025
This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: https://visual-ai.github.io/seal/
comment: Accepted to NeurIPS 2025
☆ Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for real-time view synthesis in colonoscopy, enabling critical applications such as virtual colonoscopy and lesion tracking. However, the vanilla 3DGS assumes static illumination and that observed appearance depends solely on viewing angle, which causes incompatibility with the photometric variations in colonoscopic scenes induced by dynamic light source/camera. This mismatch forces most 3DGS methods to introduce structure-violating vaporous Gaussian blobs between the camera and tissues to compensate for illumination attenuation, ultimately degrading the quality of 3D reconstructions. Previous works only consider the illumination attenuation caused by light distance, ignoring the physical characters of light source and camera. In this paper, we propose ColIAGS, an improved 3DGS framework tailored for colonoscopy. To mimic realistic appearance under varying illumination, we introduce an Improved Appearance Modeling with two types of illumination attenuation factors, which enables Gaussians to adapt to photometric variations while preserving geometry accuracy. To ensure the geometry approximation condition of appearance modeling, we propose an Improved Geometry Modeling using high-dimensional view embedding to enhance Gaussian geometry attribute prediction. Furthermore, another cosine embedding input is leveraged to generate illumination attenuation solutions in an implicit manner. Comprehensive experimental results on standard benchmarks demonstrate that our proposed ColIAGS achieves the dual capabilities of novel view synthesis and accurate geometric reconstruction. It notably outperforms other state-of-the-art methods by achieving superior rendering fidelity while significantly reducing Depth MSE. Code will be available.
☆ IF-VidCap: Can Video Caption Models Follow Instructions?
Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
comment: https://github.com/NJU-LINK/IF-VidCap
☆ SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5$\times$ reduction in memory usage and a notable 6.6$\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.
☆ PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting NeurIPS 2025
This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at https://lck666666.github.io/plana3r
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025). The project page is available at: https://lck666666.github.io/plana3r
☆ A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from the query-based neighboring area in the next frame in a sliding-window manner. Then, the constructed affinity matrix is used to aggregate contextual information for appearance modeling and is converted into motion features for motion modeling as well. We validate the motion modeling capacities of our method on four widely-used datasets, and our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets, i.e., Something-Something V1 & V2.
comment: accepted by Pattern Recognition. We have been always curious to see whether our designs could be beneficial in other scenarios, such as embedding it into the DiT model or 3D-VAE for video generation. If you are interested in it, why not give it a shot?
☆ Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.
comment: Project page: this https://linyq17.github.io/VC2L/
☆ UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
comment: Project page: codegoat24.github.io/UniGenBench/
☆ MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
comment: 15 pages, 12 figures
☆ Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification
This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.
comment: Published in The 12th IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2025
☆ Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches
The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.
comment: Submitted to the IEEE Journal of Biomedical And Health Informatics
☆ Image augmentation with invertible networks in interactive satellite image change detection
This paper devises a novel interactive satellite image change detection algorithm based on active learning. Our framework employs an iterative process that leverages a question-and-answer model. This model queries the oracle (user) about the labels of a small subset of images (dubbed as display), and based on the oracle's responses, change detection model is dynamically updated. The main contribution of our framework resides in a novel invertible network that allows augmenting displays, by mapping them from highly nonlinear input spaces to latent ones, where augmentation transformations become linear and more tractable. The resulting augmented data are afterwards mapped back to the input space, and used to retrain more effective change detection criteria in the subsequent iterations of active learning. Experimental results demonstrate superior performance of our proposed method compared to the related work.
☆ Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression NeurIPS 2025
This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.
comment: Accepted to NeurIPS 2025
☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
☆ C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression BMVC2025
Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.
comment: 10 pages, BMVC2025
☆ Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.
comment: 12 pages, 4 figures
☆ CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
comment: 24 pages, 6 figures
☆ CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder NeurIPS 2025
Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
comment: NeurIPS 2025
☆ Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.
comment: 11 pages, 6 figures
☆ Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving
Robust perception in automated driving requires reliable performance under adverse conditions, where sensors may be affected by partial failures or environmental occlusions. Although existing autonomous driving datasets inherently contain sensor noise and environmental variability, very few enable controlled, parameterised, and reproducible degradations across multiple sensing modalities. This gap limits the ability to systematically evaluate how perception and fusion architectures perform under well-defined adverse conditions. To address this limitation, we introduce the Occluded nuScenes Dataset, a novel extension of the widely used nuScenes benchmark. For the camera modality, we release both the full and mini versions with four types of occlusions, two adapted from public implementations and two newly designed. For radar and LiDAR, we provide parameterised occlusion scripts that implement three types of degradations each, enabling flexible and repeatable generation of occluded data. This resource supports consistent, reproducible evaluation of perception models under partial sensor failures and environmental interference. By releasing the first multi-sensor occlusion dataset with controlled and reproducible degradations, we aim to advance research on robust sensor fusion, resilience analysis, and safety-critical perception in automated driving.
☆ GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization
This technical report outlines the top-ranking solution for RoboSense 2025: Track 3, achieving state-of-the-art performance on 3D object detection under various sensor placements. Our submission utilizes GBlobs, a local point cloud feature descriptor specifically designed to enhance model generalization across diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer from a \enquote{geometric shortcut} when trained on conventional global features (\ie, absolute Cartesian coordinates). This introduces a position bias that causes models to primarily rely on absolute object position rather than distinguishing shape and appearance characteristics. Although effective for in-domain data, this shortcut severely limits generalization when encountering different point distributions, such as those resulting from varying sensor placements. By using GBlobs as network input features, we effectively circumvent this geometric shortcut, compelling the network to learn robust, object-centric representations. This approach significantly enhances the model's ability to generalize, resulting in the exceptional performance demonstrated in this challenge.
comment: 1st place at the IROS'25 RoboSense Challenge, Track #3: Cross-Sensor Placement 3D Object Detection
☆ RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
☆ DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and annotated it using the custom tool Annotated Lab. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 77% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of "Greener AI" models to support real-time, sustainable waste sorting on edge devices.
comment: 8 pages, 8 figures
☆ Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
comment: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
☆ Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.
comment: Project page is available at https://liujf1226.github.io/Mono4DGS-HDR/
☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: Code and models available at: https://github.com/tianciB/VFM-VAE
☆ LAND: Lung and Nodule Diffusion for 3D Chest CT Synthesis with Anatomical Guidance
This work introduces a new latent diffusion model to generate high-quality 3D chest CT scans conditioned on 3D anatomical masks. The method synthesizes volumetric images of size 256x256x256 at 1 mm isotropic resolution using a single mid-range GPU, significantly lowering the computational cost compared to existing approaches. The conditioning masks delineate lung and nodule regions, enabling precise control over the output anatomical features. Experimental results demonstrate that conditioning solely on nodule masks leads to anatomically incorrect outputs, highlighting the importance of incorporating global lung structure for accurate conditional synthesis. The proposed approach supports the generation of diverse CT volumes with and without lung nodules of varying attributes, providing a valuable tool for training AI models or healthcare professionals.
☆ Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection ICCV 2025
At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at https://github.com/xiaohainku/RISE.
comment: ICCV 2025
☆ ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
☆ ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters
Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.
☆ Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling
This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.
comment: 6 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering 2025
☆ Bayesian Fully-Connected Tensor Network for Hyperspectral-Multispectral Image Fusion
Tensor decomposition is a powerful tool for data analysis and has been extensively employed in the field of hyperspectral-multispectral image fusion (HMF). Existing tensor decomposition-based fusion methods typically rely on disruptive data vectorization/reshaping or impose rigid constraints on the arrangement of factor tensors, hindering the preservation of spatial-spectral structures and the modeling of cross-dimensional correlations. Although recent advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have partially alleviated these limitations, the process of reorganizing data into higher-order tensors still disrupts the intrinsic spatial-spectral structure. Furthermore, these methods necessitate extensive manual parameter tuning and exhibit limited robustness against noise and spatial degradation. To alleviate these issues, we propose the Bayesian FCTN (BFCTN) method. Within this probabilistic framework, a hierarchical sparse prior that characterizing the sparsity of physical elements, establishes connections between the factor tensors. This framework explicitly models the intrinsic physical coupling among spatial structures, spectral signatures, and local scene homogeneity. For model learning, we develop a parameter estimation method based on Variational Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which significantly reduces the need for manual parameter tuning. Extensive experiments demonstrate that BFCTN not only achieves state-of-the-art fusion accuracy and strong robustness but also exhibits practical applicability in complex real-world scenarios.
☆ Entropy-Enhanced Conformal Features from Ricci Flow for Robust Alzheimer's Disease Classification
Background and Objective: In brain imaging, geometric surface models are essential for analyzing the 3D shapes of anatomical structures. Alzheimer's disease (AD) is associated with significant cortical atrophy, making such shape analysis a valuable diagnostic tool. The objective of this study is to introduce and validate a novel local surface representation method for the automated and accurate diagnosis of AD. Methods: The study utilizes T1-weighted MRI scans from 160 participants (80 AD patients and 80 healthy controls) from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Cortical surface models were reconstructed from the MRI data using Freesurfer. Key geometric attributes were computed from the 3D meshes. Area distortion and conformal factor were derived using Ricci flow for conformal parameterization, while Gaussian curvature was calculated directly from the mesh geometry. Shannon entropy was applied to these three features to create compact and informative feature vectors. The feature vectors were used to train and evaluate a suite of classifiers (e.g. XGBoost, MLP, Logistic Regression, etc.). Results: Statistical significance of performance differences between classifiers was evaluated using paired Welch's t-test. The method proved highly effective in distinguishing AD patients from healthy controls. The Multi-Layer Perceptron (MLP) and Logistic Regression classifiers outperformed all others, achieving an accuracy and F$_1$ Score of 98.62%. Conclusions: This study confirms that the entropy of conformally-derived geometric features provides a powerful and robust metric for cortical morphometry. The high classification accuracy underscores the method's potential to enhance the study and diagnosis of Alzheimer's disease, offering a straightforward yet powerful tool for clinical research applications.
☆ S2AP: Score-space Sharpness Minimization for Adversarial Pruning
Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.
☆ Cross-Modal Scene Semantic Alignment for Image Complexity Assessment
Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.
comment: 14 pages,2 figures, British Machine Vision Conference
☆ FeatureFool: Zero-Query Fooling of Video Models via Feature Map
The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70\% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.
☆ Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.
☆ Learning Human-Object Interaction as Groups
Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.
☆ Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.
☆ AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.
comment: 13 pages, 9 figures
☆ GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.
comment: This work was initially drafted in November 2022
☆ ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation
Realistic view extrapolation is critical for closed-loop simulation in autonomous driving, yet it remains a significant challenge for current Novel View Synthesis (NVS) methods, which often produce distorted and inconsistent images beyond the original trajectory. This report presents our winning solution which ctook first place in the RealADSim Workshop NVS track at ICCV 2025. To address the core challenges of street view extrapolation, we introduce a comprehensive four-stage pipeline. First, we employ a data-driven initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding local minima. Second, we inject strong geometric priors by modeling the road surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a generative prior to create pseudo ground truth for extrapolated viewpoints, providing auxilary supervision. Finally, a data-driven adaptation network removes time-specific artifacts. On the RealADSim-NVS benchmark, our method achieves a final score of 0.441, ranking first among all participants.
☆ Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net
The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.
comment: Submitted to a SN journal
☆ Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at https://github.com/jinlin2021/ATED.
☆ OmniNWM: Omniscient Driving Navigation World Models
Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.
comment: https://arlo0o.github.io/OmniNWM/
☆ The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
comment: Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025
☆ Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model's proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at https://github.com/xmed-lab/Med-RwR.
comment: Work in progress
☆ GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation ICCV
We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.
comment: Accepted to ICCV Findings 2025. The first two authors contributed equally. The last two authors share co-corresponding authorship
Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: https://vishal-vinod.github.io/gmpi-edit/
comment: 14 pages, 7 figures
☆ StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
☆ TreeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation
In medical image segmentation tasks, Domain Generalization (DG) under the Federated Learning (FL) framework is crucial for addressing challenges related to privacy protection and data heterogeneity. However, traditional federated learning methods fail to account for the imbalance in information aggregation across clients in cross-domain scenarios, leading to the Global Drift (GD) problem and a consequent decline in model generalization performance. This motivates us to delve deeper and define a new critical issue: global drift in federated domain generalization for medical imaging (FedDG-GD). In this paper, we propose a novel tree topology framework called TreeFedDG. First, starting from the distributed characteristics of medical images, we design a hierarchical parameter aggregation method based on a tree-structured topology to suppress deviations in the global model direction. Second, we introduce a parameter difference-based style mixing method (FedStyle), which enforces mixing among clients with maximum parameter differences to enhance robustness against drift. Third, we develop a a progressive personalized fusion strategy during model distribution, ensuring a balance between knowledge transfer and personalized features. Finally, during the inference phase, we use feature similarity to guide the retrieval of the most relevant model chain from the tree structure for ensemble decision-making, thereby fully leveraging the advantages of hierarchical knowledge. We conducted extensive experiments on two publicly available datasets. The results demonstrate that our method outperforms other state-of-the-art domain generalization approaches in these challenging tasks and achieves better balance in cross-domain performance.
☆ Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization ICME2025
Existing 3D human mesh recovery methods often fail to fully exploit the latent information (e.g., human motion, shape alignment), leading to issues with limb misalignment and insufficient local details in the reconstructed human mesh (especially in complex scenes). Furthermore, the performance improvement gained by modelling mesh vertices and pose node interactions using attention mechanisms comes at a high computational cost. To address these issues, we propose a two-stage network for human mesh recovery based on latent information and low dimensional learning. Specifically, the first stage of the network fully excavates global (e.g., the overall shape alignment) and local (e.g., textures, detail) information from the low and high-frequency components of image features and aggregates this information into a hybrid latent frequency domain feature. This strategy effectively extracts latent information. Subsequently, utilizing extracted hybrid latent frequency domain features collaborates to enhance 2D poses to 3D learning. In the second stage, with the assistance of hybrid latent features, we model the interaction learning between the rough 3D human mesh template and the 3D pose, optimizing the pose and shape of the human mesh. Unlike existing mesh pose interaction methods, we design a low-dimensional mesh pose interaction method through dimensionality reduction and parallel optimization that significantly reduces computational costs without sacrificing reconstruction accuracy. Extensive experimental results on large publicly available datasets indicate superiority compared to the most state-of-the-art.
comment: Accepted by ICME2025
☆ From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
☆ UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.
comment: We have released V1, which only reports the test results. Our work is still ongoing, and the next version will be coming soon
☆ Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery ICME2025
3D human meshes show a natural hierarchical structure (like torso-limbs-fingers). But existing video-based 3D human mesh recovery methods usually learn mesh features in Euclidean space. It's hard to catch this hierarchical structure accurately. So wrong human meshes are reconstructed. To solve this problem, we propose a hyperbolic space learning method leveraging temporal motion prior for recovering 3D human meshes from videos. First, we design a temporal motion prior extraction module. This module extracts the temporal motion features from the input 3D pose sequences and image feature sequences respectively. Then it combines them into the temporal motion prior. In this way, it can strengthen the ability to express features in the temporal motion dimension. Since data representation in non-Euclidean space has been proved to effectively capture hierarchical relationships in real-world datasets (especially in hyperbolic space), we further design a hyperbolic space optimization learning strategy. This strategy uses the temporal motion prior information to assist learning, and uses 3D pose and pose motion information respectively in the hyperbolic space to optimize and learn the mesh features. Then, we combine the optimized results to get an accurate and smooth human mesh. Besides, to make the optimization learning process of human meshes in hyperbolic space stable and effective, we propose a hyperbolic mesh optimization loss. Extensive experimental results on large publicly available datasets indicate superiority in comparison with most state-of-the-art.
comment: Accepted by ICME2025
☆ OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
☆ BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5\% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27\%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3\% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.
comment: Under Review
☆ DeepSeek-OCR: Contexts Optical Compression
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.
☆ Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis
This paper presents a generation-based debiasing framework for object detection. Prior debiasing methods are often limited by the representation diversity of samples, while naive generative augmentation often preserves the biases it aims to solve. Moreover, our analysis reveals that simply generating more data for rare classes is suboptimal due to two core issues: i) instance frequency is an incomplete proxy for the true data needs of a model, and ii) current layout-to-image synthesis lacks the fidelity and control to generate high-quality, complex scenes. To overcome this, we introduce the representation score (RS) to diagnose representational gaps beyond mere frequency, guiding the creation of new, unbiased layouts. To ensure high-quality synthesis, we replace ambiguous text prompts with a precise visual blueprint and employ a generative alignment strategy, which fosters communication between the detector and generator. Our method significantly narrows the performance gap for underrepresented object groups, \eg, improving large/rare instances by 4.4/3.6 mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP for layout accuracy in generated images.
☆ DualHash: A Stochastic Primal-Dual Algorithm with Theoretical Guarantee for Deep Hashing
Deep hashing converts high-dimensional feature vectors into compact binary codes, enabling efficient large-scale retrieval. A fundamental challenge in deep hashing stems from the discrete nature of quantization in generating the codes. W-type regularizations, such as $||z|-1|$, have been proven effective as they encourage variables toward binary values. However, existing methods often directly optimize these regularizations without convergence guarantees. While proximal gradient methods offer a promising solution, the coupling between W-type regularizers and neural network outputs results in composite forms that generally lack closed-form proximal solutions. In this paper, we present a stochastic primal-dual hashing algorithm, referred to as DualHash, that provides rigorous complexity bounds. Using Fenchel duality, we partially transform the nonconvex W-type regularization optimization into the dual space, which results in a proximal operator that admits closed-form solutions. We derive two algorithm instances: a momentum-accelerated version with $\mathcal{O}(\varepsilon^{-4})$ complexity and an improved $\mathcal{O}(\varepsilon^{-3})$ version using variance reduction. Experiments on three image retrieval databases demonstrate the superior performance of DualHash.
☆ VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
comment: 10 pages, 5 figures, 4 tables. Under review
☆ EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation
Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound videos remains difficult due to low contrast, probe-induced motion, and heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes well to static images, but its frame-independent design yields unstable predictions and temporal drift in interventional ultrasound. We introduce \textbf{EMA-SAM}, a lightweight extension of SAM-2 that incorporates a confidence-weighted exponential moving average pointer into the memory bank, providing a stable latent prototype of the tumour across frames. This design preserves temporal coherence through probe pressure and bubble occlusion while rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improves \emph{maxDice} from 0.82 (SAM-2) to 0.86 and \emph{maxIoU} from 0.72 to 0.76, while reducing false positives by 29\%. On external benchmarks, including VTUS and colonoscopy video polyp datasets, EMA-SAM achieves consistent gains of 2--5 Dice points over SAM-2. Importantly, the EMA pointer adds \textless0.1\% FLOPs, preserving real-time throughput of $\sim$30\,FPS on a single A100 GPU. These results establish EMA-SAM as a robust and efficient framework for stable tumour tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound. Codes are available here \hyperref[code {https://github.com/mdialameh/EMA-SAM}.
☆ FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo
Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emph{FST.ai 2.0}, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates {pose-based action recognition} using graph convolutional networks (GCNs), {epistemic uncertainty modeling} through credal sets, and {explainability overlays} for visual decision support. A set of {interactive dashboards} enables human--AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, FST.ai~2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an {85\% reduction in decision review time} and {93\% referee trust} in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, FST.ai~2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.
comment: 23 pages, 12 figures
☆ A Generalizable Light Transport 3D Embedding for Global Illumination
Global illumination (GI) is essential for realistic rendering but remains computationally expensive due to the complexity of simulating indirect light transport. Recent neural methods have mainly relied on per-scene optimization, sometimes extended to handle changes in camera or geometry. Efforts toward cross-scene generalization have largely stayed in 2D screen space, such as neural denoising or G-buffer based GI prediction, which often suffer from view inconsistency and limited spatial understanding. We propose a generalizable 3D light transport embedding that approximates global illumination directly from 3D scene configurations, without using rasterized or path-traced cues. Each scene is represented as a point cloud with geometric and material features. A scalable transformer models global point-to-point interactions to encode these features into neural primitives. At render time, each query point retrieves nearby primitives via nearest-neighbor search and aggregates their latent features through cross-attention to predict the desired rendering quantity. We demonstrate results on diffuse global illumination prediction across diverse indoor scenes with varying layouts, geometry, and materials. The embedding trained for irradiance estimation can be quickly adapted to new rendering tasks with limited fine-tuning. We also present preliminary results for spatial-directional radiance field estimation for glossy materials and show how the normalized field can accelerate unbiased path guiding. This approach highlights a path toward integrating learned priors into rendering pipelines without explicit ray-traced illumination cues.
☆ RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology
Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.
☆ VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis
Detecting anomalies in crowded scenes is challenging due to severe inter-person occlusions and highly dynamic, context-dependent motion patterns. Existing approaches often struggle to adapt to varying crowd densities and lack interpretable anomaly indicators. To address these limitations, we introduce VelocityNet, a dual-pipeline framework that combines head detection and dense optical flow to extract person-specific velocities. Hierarchical clustering categorizes these velocities into semantic motion classes (halt, slow, normal, and fast), and a percentile-based anomaly scoring system measures deviations from learned normal patterns. Experiments demonstrate the effectiveness of our framework in real-time detection of diverse anomalous motion patterns within densely crowded environments.
comment: 8 pages, 3 figures
♻ ☆ Glyph: Scaling Context Windows via Visual-Text Compression
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
♻ ☆ PICABench: How Far Are We from Physically Realistic Image Editing?
Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
♻ ☆ CaMiT: A Time-Aware Car Model Dataset for Classification and Generation NeurIPS 2025
AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.
comment: To be published in NeurIPS 2025 Track on Datasets and Benchmarks
♻ ☆ Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific normalization.We propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter tuning.Across three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.
♻ ☆ Leveraging AV1 motion vectors for Fast and Dense Feature Matching
We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.
comment: Accepted ICIR 2025, camera-ready version
♻ ☆ Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
comment: 16 pages, 12 figures
♻ ☆ DeepDetect: Learning All-in-One Dense Keypoints
Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).
comment: 6 pages, 6 figures, 2 tables, 7 equations
♻ ☆ Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing
Parkinson's disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients' "masked face" symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.
comment: 3 pages, 2 figures, accepted by MIND 2025
♻ ☆ LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.
comment: Submitted to ARR Rolling Review
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles
comment: 13 pages, 4 figures
♻ ☆ Latent Diffusion Model without Variational Autoencoder
Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.
♻ ☆ Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.
♻ ☆ SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.
comment: Project page: https://scenecot.github.io/
♻ ☆ PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
♻ ☆ Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
comment: homepage: https://open-bee.github.io/
♻ ☆ Fourier Transform Multiple Instance Learning for Whole Slide Image Classification
Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.
♻ ☆ UniVideo: Unified Understanding, Generation, and Editing for Videos
Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
comment: Project Website https://congwei1230.github.io/UniVideo/
♻ ☆ H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging
3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection \textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.
♻ ☆ Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning NeurIPS 2025
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.
comment: Accepted by NeurIPS 2025
VideoVerse: How Far is Your T2V Generator from a World Model?
The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.
comment: 24 Pages, 8 Figures, 11 Tables
♻ ☆ Interpretable Decision-Making for End-to-End Autonomous Driving ICCV 2025
Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.
comment: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of Out-of-Label Hazards in Autonomous Driving (2COOOL)
♻ ☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
comment: 14 pages, 8 figures, project page: https://hcplab-sysu.github.io/TAVP
♻ ☆ SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction
Accurate cortical surface reconstruction from magnetic resonance imaging (MRI) data is crucial for reliable neuroanatomical analyses. Current methods have to contend with complex cortical geometries, strict topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects. To overcome these shortcomings, we introduce SimCortex, a deep learning framework that simultaneously reconstructs all brain surfaces (left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while preserving topological properties. Our method first segments the T1w image into a nine-class tissue label map. From these segmentations, we generate subject-specific, collision-free initial surface meshes. These surfaces serve as precise initializations for subsequent multiscale diffeomorphic deformations. Employing stationary velocity fields (SVFs) integrated via scaling-and-squaring, our approach ensures smooth, topology-preserving transformations with significantly reduced surface collisions and self-intersections. Evaluations on standard datasets demonstrate that SimCortex dramatically reduces surface overlaps and self-intersections, surpassing current methods while maintaining state-of-the-art geometric accuracy.
comment: Metadata update: added journal reference and DOI linking to the published chapter (Springer)
♻ ☆ Increasing the Utility of Synthetic Images through Chamfer Guidance NeurIPS 2025
Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms of distributional coverage, which increase to 97.5% and 92.7%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15% for in-distribution over the baselines, and up to 16% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.
comment: Accepted to NeurIPS 2025
♻ ☆ RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative Models
Diffusion models have achieved state-of-the-art performance in generative modeling, yet their sampling procedures remain vulnerable to hallucinations-often stemming from inaccuracies in score approximation. In this work, we reinterpret diffusion sampling through the lens of optimization and introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method that detects and corrects high-risk sampling steps using geometric cues from the loss landscape. RODS enforces smoother sampling trajectories and adaptively adjusts perturbations, reducing hallucinations without retraining and at minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands demonstrate that RODS maintains comparable image quality and preserves generation diversity. More importantly, it improves both sampling fidelity and robustness, detecting over 70% of hallucinated samples and correcting more than 25%, all while avoiding the introduction of new artifacts. We release our code at https://github.com/Yiqi-Verna-Tian/RODS.
♻ ☆ Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation NeurIPS 2025
Common computer vision systems typically assume ideal pinhole cameras but fail when facing real-world camera effects such as fisheye distortion and rolling shutter, mainly due to the lack of learning from training data with camera effects. Existing data generation approaches suffer from either high costs, sim-to-real gaps or fail to accurately model camera effects. To address this bottleneck, we propose 4D Gaussian Ray Tracing (4D-GRT), a novel two-stage pipeline that combines 4D Gaussian Splatting with physically-based ray tracing for camera effect simulation. Given multi-view videos, 4D-GRT first reconstructs dynamic scenes, then applies ray tracing to generate videos with controllable, physically accurate camera effects. 4D-GRT achieves the fastest rendering speed while performing better or comparable rendering quality compared to existing baselines. Additionally, we construct eight synthetic dynamic scenes in indoor environments across four camera effects as a benchmark to evaluate generated videos with camera effects.
comment: Paper accepted to NeurIPS 2025 Workshop SpaVLE. Project page: https://shigon255.github.io/4DGRT-project-page/
♻ ☆ VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
comment: Project page: https://faceong.github.io/VIKI-R/
♻ ☆ Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles
Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for navigating indoor and hard-to-reach areas, yet their significant constraints in payload and autonomy have largely prevented their use for complex tasks like high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we introduce a novel system architecture that enables fully autonomous, high-fidelity 3D scanning of static objects using UAVs weighing under 100 grams. Our core innovation lies in a dual-reconstruction pipeline that creates a real-time feedback loop between data capture and flight control. A near-real-time (near-RT) process uses Structure from Motion (SfM) to generate an instantaneous pointcloud of the object. The system analyzes the model quality on the fly and dynamically adapts the UAV's trajectory to intelligently capture new images of poorly covered areas. This ensures comprehensive data acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR) approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB) location data to achieve superior accuracy. We implemented and validated this architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both single- and multi-UAV configurations, conclusively show that dynamic trajectory adaptation consistently improves reconstruction quality over static flight paths. This work demonstrates a scalable and autonomous solution that unlocks the potential of miniaturized UAVs for fine-grained 3D reconstruction in constrained environments, a capability previously limited to much larger platforms.
comment: 13 pages, 16 figures, 3 tables, 45 references
♻ ☆ Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning
Medical Vision Foundation Models (Med-VFMs) have superior capabilities of interpreting medical images due to the knowledge learned from self-supervised pre-training with extensive unannotated images. To improve their performance on adaptive downstream evaluations, especially segmentation, a few samples from target domains are selected randomly for fine-tuning them. However, there lacks works to explore the way of adapting Med-VFMs to achieve the optimal performance on target domains efficiently. Thus, it is highly demanded to design an efficient way of fine-tuning Med-VFMs by selecting informative samples to maximize their adaptation performance on target domains. To achieve this, we propose an Active Source-Free Domain Adaptation (ASFDA) method to efficiently adapt Med-VFMs to target domains for volumetric medical image segmentation. This ASFDA employs a novel Active Learning (AL) method to select the most informative samples from target domains for fine-tuning Med-VFMs without the access to source pre-training samples, thus maximizing their performance with the minimal selection budget. In this AL method, we design an Active Test Time Sample Query strategy to select samples from the target domains via two query metrics, including Diversified Knowledge Divergence (DKD) and Anatomical Segmentation Difficulty (ASD). DKD is designed to measure the source-target knowledge gap and intra-domain diversity. It utilizes the knowledge of pre-training to guide the querying of source-dissimilar and semantic-diverse samples from the target domains. ASD is designed to evaluate the difficulty in segmentation of anatomical structures by measuring predictive entropy from foreground regions adaptively. Additionally, our ASFDA method employs a Selective Semi-supervised Fine-tuning to improve the performance and efficiency of fine-tuning by identifying samples with high reliability from unqueried ones.
comment: 17 pages, 5 figures, 8 tables
♻ ☆ Deep Learning in Palmprint Recognition-A Comprehensive Survey
Palmprint recognition has emerged as a prominent biometric technology, widely applied in diverse scenarios. Traditional handcrafted methods for palmprint recognition often fall short in representation capability, as they heavily depend on researchers' prior knowledge. Deep learning (DL) has been introduced to address this limitation, leveraging its remarkable successes across various domains. While existing surveys focus narrowly on specific tasks within palmprint recognition-often grounded in traditional methodologies-there remains a significant gap in comprehensive research exploring DL-based approaches across all facets of palmprint recognition. This paper bridges that gap by thoroughly reviewing recent advancements in DL-powered palmprint recognition. The paper systematically examines progress across key tasks, including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges. Beyond highlighting these advancements, the paper identifies current challenges and uncovers promising opportunities for future research. By consolidating state-of-the-art progress, this review serves as a valuable resource for researchers, enabling them to stay abreast of cutting-edge technologies and drive innovation in palmprint recognition.
comment: Palmprint recognition, biometrics, deep learning, feature extraction, recognition tasks
♻ ☆ UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning NeurIPS 2025
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
comment: NeurIPS 2025 Camera Ready. Project Page: https://polyu-chenlab.github.io/unipixel/
♻ ☆ Regression is all you need for medical image translation
While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved impressive results in natural image synthesis, their core strengths - creativity and realism - can be detrimental in medical applications, where accuracy and fidelity are paramount. These models instead risk introducing hallucinations and replication of unwanted acquisition noise. Here, we propose YODA (You Only Denoise once - or Average), a 2.5D diffusion-based framework for medical image translation (MIT). Consistent with DM theory, we find that conventional diffusion sampling stochastically replicates noise. To mitigate this, we draw and average multiple samples, akin to physical signal averaging. As this effectively approximates the DM's expected value, we term this Expectation-Approximation (ExpA) sampling. We additionally propose regression sampling YODA, which retains the initial DM prediction and omits iterative refinement to produce noise-free images in a single step. Across five diverse multi-modal datasets - including multi-contrast brain MRI and pelvic MRI-CT - we demonstrate that regression sampling is not only substantially more efficient but also matches or exceeds image quality of full diffusion sampling even with ExpA. Our results reveal that iterative refinement solely enhances perceptual realism without benefiting information translation, which we confirm in relevant downstream tasks. YODA outperforms eight state-of-the-art DMs and GANs and challenges the presumed superiority of DMs and GANs over computationally cheap regression models for high-quality MIT. Furthermore, we show that YODA-translated images are interchangeable with, or even superior to, physical acquisitions for several medical applications.
♻ ☆ ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text
Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.
comment: Accepted by PRCV 2025
♻ ☆ Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation NeurIPS 2025
Diffusion-based inverse algorithms have shown remarkable performance across various inverse problems, yet their reliance on numerous denoising steps incurs high computational costs. While recent developments of fast diffusion ODE solvers offer effective acceleration for diffusion sampling without observations, their application in inverse problems remains limited due to the heterogeneous formulations of inverse algorithms and their prevalent use of approximations and heuristics, which often introduce significant errors that undermine the reliability of analytical solvers. In this work, we begin with an analysis of ODE solvers for inverse problems that reveals a linear combination structure of approximations for the inverse trajectory. Building on this insight, we propose a canonical form that unifies a broad class of diffusion-based inverse algorithms and facilitates the design of more generalizable solvers. Inspired by the linear subspace search strategy, we propose Learnable Linear Extrapolation (LLE), a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm conforming to our canonical form. LLE optimizes the combination coefficients to refine current predictions using previous estimates, alleviating the sensitivity of analytical solvers for inverse algorithms. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
comment: Accepted by NeurIPS 2025
♻ ☆ Pose-free 3D Gaussian splatting via shape-ray estimation ICIP 2025
While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting. Code is avilable at https://github.com/youngju-na/SHARE
comment: ICIP 2025 (Best Student Paper Award) Code available at: https://github.com/youngju-na/SHARE
♻ ☆ Mask Image Watermarking NeurIPS
We present MaskWM, a simple, efficient, and flexible framework for image watermarking. MaskWM has two variants: (1) MaskWM-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection; (2) MaskWM-ED, which focuses on local watermark embedding and extraction, offering enhanced robustness in small regions to support fine-grined image protection. MaskWM-D builds on the classical encoder-distortion layer-decoder training paradigm. In MaskWM-D, we introduce a simple masking mechanism during the decoding stage that enables both global and local watermark extraction. During training, the decoder is guided by various types of masks applied to watermarked images before extraction, helping it learn to localize watermarks and extract them from the corresponding local areas. MaskWM-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions, which improves robustness under regional attacks. Extensive experiments show that MaskWM achieves state-of-the-art performance in global and local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. In addition, MaskWM is highly efficient and adaptable. It requires only 20 hours of training on a single A6000 GPU, achieving 15x computational efficiency compared to WAM. By simply adjusting the distortion layer, MaskWM can be quickly fine-tuned to meet varying robustness requirements.
comment: Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ scSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy NeurIPS 2025
Fluorescence microscopy, while being a key driver for progress in the life sciences, is also subject to technical limitations. To overcome them, computational multiplexing techniques have recently been proposed, which allow multiple cellular structures to be captured in a single image and later be unmixed. Existing image decomposition methods are trained on a set of superimposed input images and the respective unmixed target images. It is critical to note that the relative strength (mixing ratio) of the superimposed images for a given input is a priori unknown. However, existing methods are trained on a fixed intensity ratio of superimposed inputs, making them not cognizant of the range of relative intensities that can occur in fluorescence microscopy. In this work, we propose a novel method called scSplit that is cognizant of the severity of the above-mentioned mixing ratio. Our idea is based on InDI , a popular iterative method for image restoration, and an ideal starting point to embrace the unknown mixing ratio in any given input. We introduce (i) a suitably trained regressor network that predicts the degradation level (mixing ratio) of a given input image and (ii) a degradation-specific normalization module, enabling degradation-aware inference across all mixing ratios. We show that this method solves two relevant tasks in fluorescence microscopy, namely image splitting and bleedthrough removal, and empirically demonstrate the applicability of scSplit on 5 public datasets. The source code with pre-trained models is hosted at https://github.com/juglab/scSplit/.
comment: manuscript accepted at NeurIPS 2025
♻ ☆ A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography
Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson's r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson's r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.
comment: Paper accepted to Human Brain Mapping. 25 pages, 3 figures, 8 tables
♻ ☆ REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
♻ ☆ VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching NeurIPS 2025
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.
comment: Accepted to NeurIPS 2025
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computation has not been thoroughly explored in the context of image quality assessment (IQA), a task depending critically on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.
♻ ☆ MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
comment: We have come across a recent approach that has not been properly attributed at the time of submission and compared in a fair setting. Therefore, we would like to withdraw the paper to address these concerns
♻ ☆ ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model NeurIPS2025
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.
comment: NeurIPS2025 Accepted Paper
♻ ☆ Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
3D medical image analysis is essential for modern healthcare, yet traditional task-specific models are inadequate due to limited generalizability across diverse clinical scenarios. Multimodal large language models (MLLMs) offer a promising solution to these challenges. However, existing MLLMs have limitations in fully leveraging the rich, hierarchical information embedded in 3D medical images. Inspired by clinical practice, where radiologists focus on both 3D spatial structure and 2D planar content, we propose Med-2E3, a 3D medical MLLM that integrates a dual 3D-2D encoder architecture. To aggregate 2D features effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions. To the best of our knowledge, Med-2E3 is the first MLLM to integrate both 3D and 2D features for 3D medical image analysis. Experiments on large-scale, open-source 3D medical multimodal datasets demonstrate that TG-IS exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models. The code is available at: https://github.com/MSIIP/Med-2E3
♻ ☆ GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction NeurIPS 2025
Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ DA$^2$: Depth Anything in Any Direction
Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.
comment: Work primarily done during an internship at Tencent Hunyuan. Project page: https://depth-any-in-any-dir.github.io/
♻ ☆ H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting
Dynamic scene reconstruction poses a persistent challenge in 3D vision. Deformable 3D Gaussian Splatting has emerged as an effective method for this task, offering real-time rendering and high visual fidelity. This approach decomposes a dynamic scene into a static representation in a canonical space and time-varying scene motion. Scene motion is defined as the collective movement of all Gaussian points, and for compactness, existing approaches commonly adopt implicit neural fields or sparse control points. However, these methods predominantly rely on gradient-based optimization for all motion information. Due to the high degree of freedom, they struggle to converge on real-world datasets exhibiting complex motion. To preserve the compactness of motion representation and address convergence challenges, this paper proposes heterogeneous 3D control points, termed \textbf{H3D control points}, whose attributes are obtained using a hybrid strategy combining optical flow back-projection and gradient-based methods. This design decouples directly observable motion components from those that are geometrically occluded. Specifically, components of 3D motion that project onto the image plane are directly acquired via optical flow back projection, while unobservable portions are refined through gradient-based optimization. Experiments on the Neu3DV and CMU-Panoptic datasets demonstrate that our method achieves superior performance over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably, our method converges within just 100 iterations and achieves a per-frame processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.
♻ ☆ WMamba: Wavelet-based Mamba for Face Forgery Detection ACM MM 2025
The rapid evolution of deepfake generation technologies necessitates the development of robust face forgery detection algorithms. Recent studies have demonstrated that wavelet analysis can enhance the generalization abilities of forgery detectors. Wavelets effectively capture key facial contours, often slender, fine-grained, and globally distributed, that may conceal subtle forgery artifacts imperceptible in the spatial domain. However, current wavelet-based approaches fail to fully exploit the distinctive properties of wavelet data, resulting in sub-optimal feature extraction and limited performance gains. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear complexity. This efficiency allows for the extraction of fine-grained, globally distributed forgery artifacts from small image patches. Extensive experiments show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness in face forgery detection.
comment: Accepted by ACM MM 2025
♻ ☆ Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable NeurIPS 2025
Existing detectors are often trained on biased datasets, leading to the possibility of overfitting on non-causal image attributes that are spuriously correlated with real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when applied to unbiased datasets. One common solution is to perform dataset alignment through generative reconstruction, matching the semantic content between real and synthetic images. However, we revisit this approach and show that pixel-level alignment alone is insufficient. The reconstructed images still suffer from frequency-level misalignment, which can perpetuate spurious correlations. To illustrate, we observe that reconstruction models tend to restore the high-frequency details lost in real images (possibly due to JPEG compression), inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images for testing detector performance on the most aligned dataset, and EvalGEN, featuring the latest generative models for assessing detectors under new generative architectures such as visual auto-regressive generators. Finally, our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on in-the-wild benchmarks, highlighting the improved generalizability of unbiased detectors. Our code is available at: https://github.com/roy-ch/Dual-Data-Alignment.
comment: NeurIPS 2025 Spotlight. 13 Pages, 10 figures
♻ ☆ gen2seg: Generative Models Enable Generalizable Instance Segmentation
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.
comment: Website: https://reachomk.github.io/gen2seg/
♻ ☆ Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning
Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.
comment: Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
♻ ☆ From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes NeurIPS 2025
3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.00% accuracy on space-level tasks and 31.46% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scenes beyond object-level semantics.
comment: NeurIPS 2025 (Datasets and Benchmarks). Project page: https://anywhere-3d.github.io/
Polyline Path Masked Attention for Vision Transformer
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.
♻ ☆ Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge NeurIPS 2025
Facial personalization faces challenges to maintain identity fidelity without disrupting the foundation model's prompt consistency. The mainstream personalization models employ identity embedding to integrate identity information within the attention mechanisms. However, our preliminary findings reveal that identity embeddings compromise the effectiveness of other tokens in the prompt, thereby limiting high prompt consistency and attribute-level controllability. Moreover, by deactivating identity embedding, personalization models still demonstrate the underlying foundation models' ability to control facial attributes precisely. It suggests that such foundation models' knowledge can be leveraged to cure the ill-aligned prompt consistency of personalization models. Building upon these insights, we propose FreeCure, a framework that improves the prompt consistency of personalization models with their latent foundation models' knowledge. First, by setting a dual inference paradigm with/without identity embedding, we identify attributes (e.g., hair, accessories, etc.) for enhancements. Second, we introduce a novel foundation-aware self-attention module, coupled with an inversion-based process to bring well-aligned attribute information to the personalization process. Our approach is training-free, and can effectively enhance a wide array of facial attributes; and it can be seamlessly integrated into existing popular personalization models based on both Stable Diffusion and FLUX. FreeCure has consistently shown significant improvements in prompt consistency across these facial personalization models while maintaining the integrity of their original identity fidelity.
comment: Accepted to NeurIPS 2025
♻ ☆ Global Prompt Refinement with Non-Interfering Attention Masking for One-Shot Federated Learning NeurIPS'25
Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the Global Prompt Refinement with Non-Interfering Attention Masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.
comment: NeurIPS'25 accepted
♻ ☆ Exploring Cross-Modal Flows for Few-Shot Learning
Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
comment: 13 pages, 6 figures
♻ ☆ CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
The sparse cross-modality detector offers more advantages than its counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.
comment: 13 pages
♻ ☆ Class-wise Balancing Data Replay for Federated Class-Incremental Learning NeurIPS'25
Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model's overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.
comment: NeurIPS'25 Accepted, Oral
♻ ☆ When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.
comment: 2nd version update to Jun.2025
♻ ☆ Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
comment: Agent4Science 2025 Spotlight
♻ ☆ Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM -- by prompting the model to produce a reasoning chain before providing an answer -- can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.
♻ ☆ Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation
The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.
♻ ☆ MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.
♻ ☆ RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and an open-loop gap. In this work, we propose RAD, a 3DGS-based closed-loop Reinforcement Learning (RL) framework for end-to-end Autonomous Driving. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards to guide the policy in effectively responding to safety-critical events and understanding real-world causal relationships. To better align with human driving behavior, we incorporate IL into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, particularly exhibiting a 3x lower collision rate. Abundant closed-loop results are presented in the supplementary material. Code is available at https://github.com/hustvl/RAD for facilitating future research.
comment: Code: https://github.com/hustvl/RAD
♻ ☆ SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
comment: 22 pages,15 figures
♻ ☆ Learning Collaborative Knowledge with Multimodal Representation for Polyp Re-Identification
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Worsely, these solutions typically learn unimodal modal representations on the basis of visual samples, which fails to explore complementary information from other different modalities. To address this challenge, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification, which can effectively encourage multimodal knowledge collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized visual-text representations for multimodal fusion via end-to-end training. Experiments on the standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the collaborative multimodal fusion strategy. The code is publicly available at https://github.com/JeremyXSC/DMCL.
♻ ☆ GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization
Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task. GeoArena has been open-sourced to support future research.
♻ ☆ A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking NeurIPS 2025
The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection AAAI 25
View transformation robustness (VTR) is critical for deep-learning-based multi-view 3D object reconstruction models, which indicates the methods' stability under inputs with various view transformations. However, existing research seldom focused on view transformation robustness in multi-view 3D object reconstruction. One direct way to improve the models' VTR is to produce data with more view transformations and add them to model training. Recent progress on large vision models, particularly Stable Diffusion models, has provided great potential for generating 3D models or synthesizing novel view images with only a single image input. Directly deploying these models at inference consumes heavy computation resources and their robustness to view transformations is not guaranteed either. To fully utilize the power of Stable Diffusion models without extra inference computation burdens, we propose to generate novel views with Stable Diffusion models for better view transformation robustness. Instead of synthesizing random views, we propose a reconstruction error-guided view selection method, which considers the reconstruction errors' spatial distribution of the 3D predictions and chooses the views that could cover the reconstruction errors as much as possible. The methods are trained and tested on sets with large view transformations to validate the 3D reconstruction models' robustness to view transformations. Extensive experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods. Code is available at: https://github.com/zqyq/VTR.
comment: AAAI 25
♻ ☆ ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction
Bed-related falls remain a major source of injury in hospitals and long-term care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only one low-cost load cell mounted under a bed leg. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps-recurrence plot, Markov transition field, and Gramian angular field-that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention.
♻ ☆ Implicit Neural Compression of Point Clouds
Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.
♻ ☆ LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
comment: Human-aware Embodied AI Workshop @ IROS 2025
♻ ☆ NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?
The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce NEBULA, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained capability tests for precise skill diagnosis with systematic stress tests that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.
comment: Homepage: https://vulab-ai.github.io/NEBULA-Alpha/
♻ ☆ SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.
comment: Project page: https://scenecot.github.io/
Machine Learning 227
☆ Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
☆ Lyapunov-Aware Quantum-Inspired Reinforcement Learning for Continuous-Time Vehicle Control: A Feasibility Study
This paper presents a novel Lyapunov-Based Quantum Reinforcement Learning (LQRL) framework that integrates quantum policy optimization with Lyapunov stability analysis for continuous-time vehicle control. The proposed approach combines the representational power of variational quantum circuits (VQCs) with a stability-aware policy gradient mechanism to ensure asymptotic convergence and safe decision-making under dynamic environments. The vehicle longitudinal control problem was formulated as a continuous-state reinforcement learning task, where the quantum policy network generates control actions subject to Lyapunov stability constraints. Simulation experiments were conducted in a closed-loop adaptive cruise control scenario using a quantum-inspired policy trained under stability feedback. The results demonstrate that the LQRL framework successfully embeds Lyapunov stability verification into quantum policy learning, enabling interpretable and stability-aware control performance. Although transient overshoot and Lyapunov divergence were observed under aggressive acceleration, the system maintained bounded state evolution, validating the feasibility of integrating safety guarantees within quantum reinforcement learning architectures. The proposed framework provides a foundational step toward provably safe quantum control in autonomous systems and hybrid quantum-classical optimization domains.
comment: 7 pages, 4 figures, 20 equations, 3 appendices, 4 tables
☆ A Hybrid Enumeration Framework for Optimal Counterfactual Generation in Post-Acute COVID-19 Heart Failure
Counterfactual inference provides a mathematical framework for reasoning about hypothetical outcomes under alternative interventions, bridging causal reasoning and predictive modeling. We present a counterfactual inference framework for individualized risk estimation and intervention analysis, illustrated through a clinical application to post-acute sequelae of COVID-19 (PASC) among patients with pre-existing heart failure (HF). Using longitudinal diagnosis, laboratory, and medication data from a large health-system cohort, we integrate regularized predictive modeling with counterfactual search to identify actionable pathways to PASC-related HF hospital admissions. The framework combines exact enumeration with optimization-based methods, including the Nearest Instance Counterfactual Explanations (NICE) and Multi-Objective Counterfactuals (MOC) algorithms, to efficiently explore high-dimensional intervention spaces. Applied to more than 2700 individuals with confirmed SARS-CoV-2 infection and prior HF, the model achieved strong discriminative performance (AUROC: 0.88, 95% CI: 0.84-0.91) and generated interpretable, patient-specific counterfactuals that quantify how modifying comorbidity patterns or treatment factors could alter predicted outcomes. This work demonstrates how counterfactual reasoning can be formalized as an optimization problem over predictive functions, offering a rigorous, interpretable, and computationally efficient approach to personalized inference in complex biomedical systems.
☆ MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.
☆ Actor-Free Continuous Control via Structurally Maximizable Q-Functions NeurIPS 2025
Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic's output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at https://github.com/USC-Lira/Q3C.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ SO(3)-invariant PCA with application to molecular data
Principal component analysis (PCA) is a fundamental technique for dimensionality reduction and denoising; however, its application to three-dimensional data with arbitrary orientations -- common in structural biology -- presents significant challenges. A naive approach requires augmenting the dataset with many rotated copies of each sample, incurring prohibitive computational costs. In this paper, we extend PCA to 3D volumetric datasets with unknown orientations by developing an efficient and principled framework for SO(3)-invariant PCA that implicitly accounts for all rotations without explicit data augmentation. By exploiting underlying algebraic structure, we demonstrate that the computation involves only the square root of the total number of covariance entries, resulting in a substantial reduction in complexity. We validate the method on real-world molecular datasets, demonstrating its effectiveness and opening up new possibilities for large-scale, high-dimensional reconstruction problems.
☆ BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility Problem
We introduce \textbf{BO4Mob}, a new benchmark framework for high-dimensional Bayesian Optimization (BO), driven by the challenge of origin-destination (OD) travel demand estimation in large urban road networks. Estimating OD travel demand from limited traffic sensor data is a difficult inverse optimization problem, particularly in real-world, large-scale transportation networks. This problem involves optimizing over high-dimensional continuous spaces where each objective evaluation is computationally expensive, stochastic, and non-differentiable. BO4Mob comprises five scenarios based on real-world San Jose, CA road networks, with input dimensions scaling up to 10,100. These scenarios utilize high-resolution, open-source traffic simulations that incorporate realistic nonlinear and stochastic dynamics. We demonstrate the benchmark's utility by evaluating five optimization methods: three state-of-the-art BO algorithms and two non-BO baselines. This benchmark is designed to support both the development of scalable optimization algorithms and their application for the design of data-driven urban mobility models, including high-resolution digital twins of metropolitan road networks. Code and documentation are available at https://github.com/UMN-Choi-Lab/BO4Mob.
☆ Search Self-play: Pushing the Frontier of Agent Capability without Supervision
Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.
☆ Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards
We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.
☆ A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation
Iterative optimization is central to modern artificial intelligence (AI) and provides a crucial framework for understanding adaptive systems. This review provides a unified perspective on this subject, bridging classic theory with neural network training and biological learning. Although gradient-based methods, powered by the efficient but biologically implausible backpropagation (BP), dominate machine learning, their computational demands can hinder scalability in high-dimensional settings. In contrast, derivative-free or zeroth-order (ZO) optimization feature computationally lighter approaches that rely only on function evaluations and randomness. While generally less sample efficient, recent breakthroughs demonstrate that modern ZO methods can effectively approximate gradients and achieve performance competitive with BP in neural network models. This ZO paradigm is also particularly relevant for biology. Its core principles of random exploration (probing) and feedback-guided adaptation (reinforcing) parallel key mechanisms of biological learning, offering a mathematically principled perspective on how the brain learns. In this review, we begin by categorizing optimization approaches based on the order of derivative information they utilize, ranging from first-, second-, and higher-order gradient-based to ZO methods. We then explore how these methods are adapted to the unique challenges of neural network training and the resulting learning dynamics. Finally, we build upon these insights to view biological learning through an optimization lens, arguing that a ZO paradigm leverages the brain's intrinsic noise as a computational resource. This framework not only illuminates our understanding of natural intelligence but also holds vast implications for neuromorphic hardware, helping us design fast and energy-efficient AI systems that exploit intrinsic hardware noise.
☆ When LRP Diverges from Leave-One-Out in Transformers
Leave-One-Out (LOO) provides an intuitive measure of feature importance but is computationally prohibitive. While Layer-Wise Relevance Propagation (LRP) offers a potentially efficient alternative, its axiomatic soundness in modern Transformers remains largely under-examined. In this work, we first show that the bilinear propagation rules used in recent advances of AttnLRP violate the implementation invariance axiom. We prove this analytically and confirm it empirically in linear attention layers. Second, we also revisit CP-LRP as a diagnostic baseline and find that bypassing relevance propagation through the softmax layer -- backpropagating relevance only through the value matrices -- significantly improves alignment with LOO, particularly in middle-to-late Transformer layers. Overall, our results suggest that (i) bilinear factorization sensitivity and (ii) softmax propagation error potentially jointly undermine LRP's ability to approximate LOO in Transformers.
comment: BlackboxNLP @ EMNLP 2025
☆ On Biologically Plausible Learning in Continuous Time
Biological learning unfolds continuously in time, yet most algorithmic models rely on discrete updates and separate inference and learning phases. We study a continuous-time neural model that unifies several biologically plausible learning algorithms and removes the need for phase separation. Rules including stochastic gradient descent (SGD), feedback alignment (FA), direct feedback alignment (DFA), and Kolen-Pollack (KP) emerge naturally as limiting cases of the dynamics. Simulations show that these continuous-time networks stably learn at biological timescales, even under temporal mismatches and integration noise. Through analysis and simulation, we show that learning depends on temporal overlap: a synapse updates correctly only when its input and the corresponding error signal coincide in time. When inputs are held constant, learning strength declines linearly as the delay between input and error approaches the stimulus duration, explaining observed robustness and failure across network depths. Critically, robust learning requires the synaptic plasticity timescale to exceed the stimulus duration by one to two orders of magnitude. For typical cortical stimuli (tens of milliseconds), this places the functional plasticity window in the few-second range, a testable prediction that identifies seconds-scale eligibility traces as necessary for error-driven learning in biological circuits.
☆ Decoding Funded Research: Comparative Analysis of Topic Models and Uncovering the Effect of Gender and Geographic Location
Optimizing national scientific investment requires a clear understanding of evolving research trends and the demographic and geographical forces shaping them, particularly in light of commitments to equity, diversity, and inclusion. This study addresses this need by analyzing 18 years (2005-2022) of research proposals funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). We conducted a comprehensive comparative evaluation of three topic modelling approaches: Latent Dirichlet Allocation (LDA), Structural Topic Modelling (STM), and BERTopic. We also introduced a novel algorithm, named COFFEE, designed to enable robust covariate effect estimation for BERTopic. This advancement addresses a significant gap, as BERTopic lacks a native function for covariate analysis, unlike the probabilistic STM. Our findings highlight that while all models effectively delineate core scientific domains, BERTopic outperformed by consistently identifying more granular, coherent, and emergent themes, such as the rapid expansion of artificial intelligence. Additionally, the covariate analysis, powered by COFFEE, confirmed distinct provincial research specializations and revealed consistent gender-based thematic patterns across various scientific disciplines. These insights offer a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies, thereby enhancing the effectiveness of the scientific ecosystem.
comment: 35 pages
☆ Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams
Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.
comment: Under review
☆ CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
Despite significant work on low-bit quantization-aware training (QAT), there is still a large accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. When pre-training Llama-style models of up to 800M-parameters, CAGE recovers over 10% of the quantization-induced loss increase in the W4A4 regime over outlier-mitigation methods. These results indicate that curvature-aware gradient corrections can bridge the remaining performance gap beyond current outlier-handling methods.
☆ Enhancing Fractional Gradient Descent with Learned Optimizers
Fractional Gradient Descent (FGD) offers a novel and promising way to accelerate optimization by incorporating fractional calculus into machine learning. Although FGD has shown encouraging initial results across various optimization tasks, it faces significant challenges with convergence behavior and hyperparameter selection. Moreover, the impact of its hyperparameters is not fully understood, and scheduling them is particularly difficult in non-convex settings such as neural network training. To address these issues, we propose a novel approach called Learning to Optimize Caputo Fractional Gradient Descent (L2O-CFGD), which meta-learns how to dynamically tune the hyperparameters of Caputo FGD (CFGD). Our method's meta-learned schedule outperforms CFGD with static hyperparameters found through an extensive search and, in some tasks, achieves performance comparable to a fully black-box meta-learned optimizer. L2O-CFGD can thus serve as a powerful tool for researchers to identify high-performing hyperparameters and gain insights on how to leverage the history-dependence of the fractional differential in optimization.
☆ A Frequentist Statistical Introduction to Variational Inference, Autoencoders, and Diffusion Models
While Variational Inference (VI) is central to modern generative models like Variational Autoencoders (VAEs) and Denoising Diffusion Models (DDMs), its pedagogical treatment is split across disciplines. In statistics, VI is typically framed as a Bayesian method for posterior approximation. In machine learning, however, VAEs and DDMs are developed from a Frequentist viewpoint, where VI is used to approximate a maximum likelihood estimator. This creates a barrier for statisticians, as the principles behind VAEs and DDMs are hard to contextualize without a corresponding Frequentist introduction to VI. This paper provides that introduction: we explain the theory for VI, VAEs, and DDMs from a purely Frequentist perspective, starting with the classical Expectation-Maximization (EM) algorithm. We show how VI arises as a scalable solution for intractable E-steps and how VAEs and DDMs are natural, deep-learning-based extensions of this framework, thereby bridging the gap between classical statistical inference and modern generative AI.
comment: This is an introduction paper. 28 pages, 2 figures
☆ Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference
Causal inference is essential for developing and evaluating medical interventions, yet real-world medical datasets are often difficult to access due to regulatory barriers. This makes synthetic data a potentially valuable asset that enables these medical analyses, along with the development of new inference methods themselves. Generative models can produce synthetic data that closely approximate real data distributions, yet existing methods do not consider the unique challenges that downstream causal inference tasks, and specifically those focused on treatments, pose. We establish a set of desiderata that synthetic data containing treatments should satisfy to maximise downstream utility: preservation of (i) the covariate distribution, (ii) the treatment assignment mechanism, and (iii) the outcome generation mechanism. Based on these desiderata, we propose a set of evaluation metrics to assess such synthetic data. Finally, we present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine that mimics the data-generating process of data containing treatments and optimises for our desiderata. We empirically demonstrate that STEAM achieves state-of-the-art performance across our metrics as compared to existing generative models, particularly as the complexity of the true data-generating process increases.
☆ Symbolic Emulators for Cosmology: Accelerating Cosmological Analyses Without Sacrificing Precision
In cosmology, emulators play a crucial role by providing fast and accurate predictions of complex physical models, enabling efficient exploration of high-dimensional parameter spaces that would be computationally prohibitive with direct numerical simulations. Symbolic emulators have emerged as promising alternatives to numerical approaches, delivering comparable accuracy with significantly faster evaluation times. While previous symbolic emulators were limited to relatively narrow prior ranges, we expand these to cover the parameter space relevant for current cosmological analyses. We introduce approximations to hypergeometric functions used for the $\Lambda$CDM comoving distance and linear growth factor which are accurate to better than 0.001% and 0.05%, respectively, for all redshifts and for $\Omega_{\rm m} \in [0.1, 0.5]$. We show that integrating symbolic emulators into a Dark Energy Survey-like $3\times2$pt analysis produces cosmological constraints consistent with those obtained using standard numerical methods. Our symbolic emulators offer substantial improvements in speed and memory usage, demonstrating their practical potential for scalable, likelihood-based inference.
comment: 22 pages, 6 figures. Invited contribution for the Royal Society Philosophical Transactions A special issue "Symbolic regression in the physical sciences"
Diffusion Buffer for Online Generative Speech Enhancement
Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple calls, resulting in a computational complexity that is too high for many online speech enhancement applications. This work presents the Diffusion Buffer, a generative diffusion-based Speech Enhancement model which only requires one neural network call per incoming signal frame from a stream of data and performs enhancement in an online fashion on a consumer-grade GPU. The key idea of the Diffusion Buffer is to align physical time with Diffusion time-steps. The approach progressively denoises frames through physical time, where past frames have more noise removed. Consequently, an enhanced frame is output to the listener with a delay defined by the Diffusion Buffer, and the output frame has a corresponding look-ahead. In this work, we extend upon our previous work by carefully designing a 2D convolutional UNet architecture that specifically aligns with the Diffusion Buffer's look-ahead. We observe that the proposed UNet improves performance, particularly when the algorithmic latency is low. Moreover, we show that using a Data Prediction loss instead of Denoising Score Matching loss enables flexible control over the trade-off between algorithmic latency and quality during inference. The extended Diffusion Buffer equipped with a novel NN and loss function drastically reduces the algorithmic latency from 320 - 960 ms to 32 - 176 ms with an even increased performance. While it has been shown before that offline generative diffusion models outperform predictive approaches in unseen noisy speech data, we confirm that the online Diffusion Buffer also outperforms its predictive counterpart on unseen noisy speech data.
☆ Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
☆ Adapting Language Balance in Code-Switching Speech
Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation -- the central challenge in code-switching -- thereby improving the model's robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.
comment: Submitted to ICASSP 2026
☆ Bayesian Low-Rank Factorization for Robust Model Adaptation
Large speech foundation models achieve strong performance across many domains, but they often require adaptation to handle local needs such as code-switching, where speakers mix languages within the same utterance. Direct fine-tuning of these models risks overfitting to the target domain and overwriting the broad capabilities of the base model. To address this challenge, we explore Bayesian factorized adapters for speech foundation models, which place priors near zero to achieve sparser adaptation matrices and thereby retain general performance while adapting to specific domains. We apply our approach to the Whisper model and evaluate on different multilingual code-switching scenarios. Our results show only minimal adaptation loss while significantly reducing catastrophic forgetting of the base model. Compared to LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new domain. These findings highlight the effectiveness of Bayesian adaptation for fine-tuning speech foundation models without sacrificing generalization.
comment: Submitted to ICASSP 2026
☆ Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options NeurIPS 2025
We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
comment: Accepted at NeurIPS 2025
☆ Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach
Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.
☆ Learning Task-Agnostic Representations through Multi-Teacher Distillation NeurIPS-2025
Casting complex inputs into tractable representations is a critical step across various fields. Diverse embedding models emerge from differences in architectures, loss functions, input modalities and datasets, each capturing unique aspects of the input. Multi-teacher distillation leverages this diversity to enrich representations but often remains tailored to specific tasks. In this paper, we introduce a task-agnostic framework based on a ``majority vote" objective function. We demonstrate that this function is bounded by the mutual information between student and teachers' embeddings, leading to a task-agnostic distillation loss that eliminates dependence on task-specific labels or prior knowledge. Our evaluations across text, vision models, and molecular modeling show that our method effectively leverages teacher diversity, resulting in representations enabling better performance for a wide range of downstream tasks such as classification, clustering, or regression. Additionally, we train and release state-of-the-art embedding models, enhancing downstream performance in various modalities.
comment: NeurIPS-2025
☆ Reasoning Language Model Inference Serving Unveiled: An Empirical Study
The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general LLM. However, the serving performance and behavior of RLLM remains unexplored, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for RLLM. Our main takeaways are that model quantization methods and speculative decoding can improve service system efficiency with small compromise to RLLM accuracy, while prefix caching, KV cache quantization may even degrade accuracy or serving performance for small RLLM. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings. Empirical results of real world workload evaluation across different dataset are aligned with our main findings regarding RLLM serving. We hope our work can provide the research community and industry with insights to advance RLLM inference serving.
☆ Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches
The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.
comment: Submitted to the IEEE Journal of Biomedical And Health Informatics
☆ Differentially Private E-Values
E-values have gained prominence as flexible tools for statistical inference and risk control, enabling anytime- and post-hoc-valid procedures under minimal assumptions. However, many real-world applications fundamentally rely on sensitive data, which can be leaked through e-values. To ensure their safe release, we propose a general framework to transform non-private e-values into differentially private ones. Towards this end, we develop a novel biased multiplicative noise mechanism that ensures our e-values remain statistically valid. We show that our differentially private e-values attain strong statistical power, and are asymptotically as powerful as their non-private counterparts. Experiments across online risk monitoring, private healthcare, and conformal e-prediction demonstrate our approach's effectiveness and illustrate its broad applicability.
☆ Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression NeurIPS 2025
This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.
comment: Accepted to NeurIPS 2025
☆ Learning Time-Varying Turn-Taking Behavior in Group Conversations
We propose a flexible probabilistic model for predicting turn-taking patterns in group conversations based solely on individual characteristics and past speaking behavior. Many models of conversation dynamics cannot yield insights that generalize beyond a single group. Moreover, past works often aim to characterize speaking behavior through a universal formulation that may not be suitable for all groups. We thus develop a generalization of prior conversation models that predicts speaking turns among individuals in any group based on their individual characteristics, that is, personality traits, and prior speaking behavior. Importantly, our approach provides the novel ability to learn how speaking inclination varies based on when individuals last spoke. We apply our model to synthetic and real-world conversation data to verify the proposed approach and characterize real group interactions. Our results demonstrate that previous behavioral models may not always be realistic, motivating our data-driven yet theoretically grounded approach.
☆ Informed Learning for Estimating Drought Stress at Fine-Scale Resolution Enables Accurate Yield Prediction
Water is essential for agricultural productivity. Assessing water shortages and reduced yield potential is a critical factor in decision-making for ensuring agricultural productivity and food security. Crop simulation models, which align with physical processes, offer intrinsic explainability but often perform poorly. Conversely, machine learning models for crop yield modeling are powerful and scalable, yet they commonly operate as black boxes and lack adherence to the physical principles of crop growth. This study bridges this gap by coupling the advantages of both worlds. We postulate that the crop yield is inherently defined by the water availability. Therefore, we formulate crop yield as a function of temporal water scarcity and predict both the crop drought stress and the sensitivity to water scarcity at fine-scale resolution. Sequentially modeling the crop yield response to water enables accurate yield prediction. To enforce physical consistency, a novel physics-informed loss function is proposed. We leverage multispectral satellite imagery, meteorological data, and fine-scale yield data. Further, to account for the uncertainty within the model, we build upon a deep ensemble approach. Our method surpasses state-of-the-art models like LSTM and Transformers in crop yield prediction with a coefficient of determination ($R^2$-score) of up to 0.82 while offering high explainability. This method offers decision support for industry, policymakers, and farmers in building a more resilient agriculture in times of changing climate conditions.
☆ Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions NeurIPS 2025
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.
comment: NeurIPS 2025
☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
☆ C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression BMVC2025
Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.
comment: 10 pages, BMVC2025
☆ Hardness of Learning Regular Languages in the Next Symbol Prediction Setting
We study the learnability of languages in the Next Symbol Prediction (NSP) setting, where a learner receives only positive examples from a language together with, for every prefix, (i) whether the prefix itself is in the language and (ii) which next symbols can lead to an accepting string. This setting has been used in prior works to empirically analyze neural sequence models, and additionally, we observe that efficient algorithms for the NSP setting can be used to learn the (truncated) support of language models. We formalize the setting so as to make it amenable to PAC-learning analysis. While the setting provides a much richer set of labels than the conventional classification setting, we show that learning concept classes such as DFAs and Boolean formulas remains computationally hard. The proof is via a construction that makes almost all additional labels uninformative, yielding a reduction from the conventional learning problem to learning with NSP labels. Under cryptographic assumptions, the reduction implies that the problem of learning DFAs is computationally hard in the NSP setting.
comment: 7 pages
☆ A Rectification-Based Approach for Distilling Boosted Trees into Decision Trees
We present a new approach for distilling boosted trees into decision trees, in the objective of generating an ML model offering an acceptable compromise in terms of predictive performance and interpretability. We explain how the correction approach called rectification can be used to implement such a distillation process. We show empirically that this approach provides interesting results, in comparison with an approach to distillation achieved by retraining the model.
comment: 29 pages
☆ Unrolled-SINDy: A Stable Explicit Method for Non linear PDE Discovery from Sparsely Sampled Data
Identifying from observation data the governing differential equations of a physical dynamics is a key challenge in machine learning. Although approaches based on SINDy have shown great promise in this area, they still fail to address a whole class of real world problems where the data is sparsely sampled in time. In this article, we introduce Unrolled-SINDy, a simple methodology that leverages an unrolling scheme to improve the stability of explicit methods for PDE discovery. By decorrelating the numerical time step size from the sampling rate of the available data, our approach enables the recovery of equation parameters that would not be the minimizers of the original SINDy optimization problem due to large local truncation errors. Our method can be exploited either through an iterative closed-form approach or by a gradient descent scheme. Experiments show the versatility of our method. On both traditional SINDy and state-of-the-art noise-robust iNeuralSINDy, with different numerical schemes (Euler, RK4), our proposed unrolling scheme allows to tackle problems not accessible to non-unrolled methods.
comment: 56 pages, 12 figures, 39 tables
☆ A Compositional Paradigm for Foundation Models: Towards Smarter Robotic Agents
The birth of Foundation Models brought unprecedented results in a wide range of tasks, from language to vision, to robotic control. These models are able to process huge quantities of data, and can extract and develop rich representations, which can be employed across different domains and modalities. However, they still have issues in adapting to dynamic, real-world scenarios without retraining the entire model from scratch. In this work, we propose the application of Continual Learning and Compositionality principles to foster the development of more flexible, efficient and smart AI solutions.
☆ Channel-Aware Vector Quantization for Robust Semantic Communication on Discrete Channels
Deep learning-based semantic communication has largely relied on analog or semi-digital transmission, which limits compatibility with modern digital communication infrastructures. Recent studies have employed vector quantization (VQ) to enable discrete semantic transmission, yet existing methods neglect channel state information during codebook optimization, leading to suboptimal robustness. To bridge this gap, we propose a channel-aware vector quantization (CAVQ) algorithm within a joint source-channel coding (JSCC) framework, termed VQJSCC, established on a discrete memoryless channel. In this framework, semantic features are discretized and directly mapped to modulation constellation symbols, while CAVQ integrates channel transition probabilities into the quantization process, aligning easily confused symbols with semantically similar codewords. A multi-codebook alignment mechanism is further introduced to handle mismatches between codebook order and modulation order by decomposing the transmission stream into multiple independently optimized subchannels. Experimental results demonstrate that VQJSCC effectively mitigates the digital cliff effect, achieves superior reconstruction quality across various modulation schemes, and outperforms state-of-the-art digital semantic communication baselines in both robustness and efficiency.
comment: 12 pages, 8 figures
☆ Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing
Graph neural networks (GNNs) are the predominant architecture for learning over graphs. As with any machine learning model, and important issue is the detection of adversarial attacks, where an adversary can change the output with a small perturbation of the input. Techniques for solving the adversarial robustness problem - determining whether such an attack exists - were originally developed for image classification, but there are variants for many other machine learning architectures. In the case of graph learning, the attack model usually considers changes to the graph structure in addition to or instead of the numerical features of the input, and the state of the art techniques in the area proceed via reduction to constraint solving, working on top of powerful solvers, e.g. for mixed integer programming. We show that it is possible to improve on the state of the art in structural robustness by replacing the use of powerful solvers by calls to efficient partial solvers, which run in polynomial time but may be incomplete. We evaluate our tool RobLight on a diverse set of GNN variants and datasets.
☆ CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder NeurIPS 2025
Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
comment: NeurIPS 2025
☆ HeFS: Helper-Enhanced Feature Selection via Pareto-Optimized Genetic Search
Feature selection is a combinatorial optimization problem that is NP-hard. Conventional approaches often employ heuristic or greedy strategies, which are prone to premature convergence and may fail to capture subtle yet informative features. This limitation becomes especially critical in high-dimensional datasets, where complex and interdependent feature relationships prevail. We introduce the HeFS (Helper-Enhanced Feature Selection) framework to refine feature subsets produced by existing algorithms. HeFS systematically searches the residual feature space to identify a Helper Set - features that complement the original subset and improve classification performance. The approach employs a biased initialization scheme and a ratio-guided mutation mechanism within a genetic algorithm, coupled with Pareto-based multi-objective optimization to jointly maximize predictive accuracy and feature complementarity. Experiments on 18 benchmark datasets demonstrate that HeFS consistently identifies overlooked yet informative features and achieves superior performance over state-of-the-art methods, including in challenging domains such as gastric cancer classification, drug toxicity prediction, and computer science applications. The code and datasets are available at https://healthinformaticslab.org/supp/.
☆ A Multi-Evidence Framework Rescues Low- Power Prognostic Signals and Rejects Statistical Artifacts in Cancer Genomics
Motivation: Standard genome-wide association studies in cancer genomics rely on statistical significance with multiple testing correction, but systematically fail in underpowered cohorts. In TCGA breast cancer (n=967, 133 deaths), low event rates (13.8%) create severe power limitations, producing false negatives for known drivers and false positives for large passenger genes. Results: We developed a five-criteria computational framework integrating causal inference (inverse probability weighting, doubly robust estimation) with orthogonal biological validation (expression, mutation patterns, literature evidence). Applied to TCGA-BRCA mortality analysis, standard Cox+FDR detected zero genes at FDR<0.05, confirming complete failure in underpowered settings. Our framework correctly identified RYR2 - a cardiac gene with no cancer function - as a false positive despite nominal significance (p=0.024), while identifying KMT2C as a complex candidate requiring validation despite marginal significance (p=0.047, q=0.954). Power analysis revealed median power of 15.1% across genes, with KMT2C achieving only 29.8% power (HR=1.55), explaining borderline statistical significance despite strong biological evidence. The framework distinguished true signals from artifacts through mutation pattern analysis: RYR2 showed 29.8% silent mutations (passenger signature) with no hotspots, while KMT2C showed 6.7% silent mutations with 31.4% truncating variants (driver signature). This multi-evidence approach provides a template for analyzing underpowered cohorts, prioritizing biological interpretability over purely statistical significance. Availability: All code and analysis pipelines available at github.com/akarlaraytu/causal- inference-for-cancer-genomics
comment: 17 pages (main text), 4 figures (main text), 7 supplementary figures, 4 supplementary tables. Focuses on a computational framework using causal inference and biological validation for underpowered cancer genomic studies
☆ RAISE: A Unified Framework for Responsible AI Scoring and Evaluation
As AI systems enter high-stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability. We introduce RAISE (Responsible AI Scoring and Evaluation), a unified framework that quantifies model performance across these four dimensions and aggregates them into a single, holistic Responsibility Score. We evaluated three deep learning models: a Multilayer Perceptron (MLP), a Tabular ResNet, and a Feature Tokenizer Transformer, on structured datasets from finance, healthcare, and socioeconomics. Our findings reveal critical trade-offs: the MLP demonstrated strong sustainability and robustness, the Transformer excelled in explainability and fairness at a very high environmental cost, and the Tabular ResNet offered a balanced profile. These results underscore that no single model dominates across all responsibility criteria, highlighting the necessity of multi-dimensional evaluation for responsible model selection. Our implementation is available at: https://github.com/raise-framework/raise.
comment: Accepted at the 26th International Conference on Principles and Practice of Multi-Agent Systems
☆ Interval Prediction of Annual Average Daily Traffic on Local Roads via Quantile Random Forest with High-Dimensional Spatial Data
Accurate annual average daily traffic (AADT) data are vital for transport planning and infrastructure management. However, automatic traffic detectors across national road networks often provide incomplete coverage, leading to underrepresentation of minor roads. While recent machine learning advances have improved AADT estimation at unmeasured locations, most models produce only point predictions and overlook estimation uncertainty. This study addresses that gap by introducing an interval prediction approach that explicitly quantifies predictive uncertainty. We integrate a Quantile Random Forest model with Principal Component Analysis to generate AADT prediction intervals, providing plausible traffic ranges bounded by estimated minima and maxima. Using data from over 2,000 minor roads in England and Wales, and evaluated with specialized interval metrics, the proposed method achieves an interval coverage probability of 88.22%, a normalized average width of 0.23, and a Winkler Score of 7,468.47. By combining machine learning with spatial and high-dimensional analysis, this framework enhances both the accuracy and interpretability of AADT estimation, supporting more robust and informed transport planning.
☆ Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.
☆ Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models
As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz's Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives: 1) they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic valuE alignment. It introduces a structural causal model (SCM) to feature complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors. Moreover, it applies counterfactual reasoning to generate outputs aligned with any desired value objectives. Benefitting from explicit causal modeling, COUPLE also provides better interpretability. We evaluate COUPLE on two datasets with different value systems and demonstrate that COUPLE advances other baselines across diverse types of value objectives.
comment: 41 pages, 7 figures
☆ Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints
The ROC curve is widely used to assess binary classification performance. Yet for some applications such as alert systems for hospitalized patient monitoring, conventional ROC analysis cannot capture crucial factors that impact deployment, such as enforcing a minimum precision constraint to avoid false alarm fatigue or imposing an upper bound on the number of predicted positives to represent the capacity of hospital staff. The usual area under the curve metric also does not reflect asymmetric costs for false positives and false negatives. In this paper we address all three of these issues. First, we show how the subset of classifiers that meet given precision and capacity constraints can be represented as a feasible region in ROC space. We establish the geometry of this feasible region. We then define the partial area of lesser classifiers, a performance metric that is monotonic with cost and only accounts for the feasible portion of ROC space. Averaging this area over a desired range of cost parameters results in the partial volume over the ROC surface, or partial VOROS. In experiments predicting mortality risk using vital sign history on the MIMIC-IV dataset, we show this cost-aware metric is better than alternatives for ranking classifiers in hospital alert applications.
☆ Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware SSL
Self-supervised learning (SSL) holds a great deal of promise for applications in neuroscience, due to the lack of large-scale, consistently labeled neural datasets. However, most neural datasets contain heterogeneous populations that mix stable, predictable cells with highly stochastic, stimulus-contingent ones, which has made it hard to identify consistent activity patterns during SSL. As a result, self-supervised pretraining has yet to show clear signs of benefits from scale on neural data. Here, we present a novel approach to self-supervised pretraining, POYO-SSL that exploits the heterogeneity of neural data to improve pre-training and achieve benefits of scale. Specifically, in POYO-SSL we pretrain only on predictable (statistically regular) neurons-identified on the pretraining split via simple higher-order statistics (skewness and kurtosis)-then we fine-tune on the unpredictable population for downstream tasks. On the Allen Brain Observatory dataset, this strategy yields approximately 12-13% relative gains over from-scratch training and exhibits smooth, monotonic scaling with model size. In contrast, existing state-of-the-art baselines plateau or destabilize as model size increases. By making predictability an explicit metric for crafting the data diet, POYO-SSL turns heterogeneity from a liability into an asset, providing a robust, biologically grounded recipe for scalable neural decoding and a path toward foundation models of neural dynamics.
☆ Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
comment: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
☆ Alibaba International E-commerce Product Search Competition DILAB Team Technical Report CIKM
This study presents the multilingual e-commerce search system developed by the DILAB team, which achieved 5th place on the final leaderboard with a competitive overall score of 0.8819, demonstrating stable and high-performing results across evaluation metrics. To address challenges in multilingual query-item understanding, we designed a multi-stage pipeline integrating data refinement, lightweight preprocessing, and adaptive modeling. The data refinement stage enhanced dataset consistency and category coverage, while language tagging and noise filtering improved input quality. In the modeling phase, multiple architectures and fine-tuning strategies were explored, and hyperparameters optimized using curated validation sets to balance performance across query-category (QC) and query-item (QI) tasks. The proposed framework exhibited robustness and adaptability across languages and domains, highlighting the effectiveness of systematic data curation and iterative evaluation for multilingual search systems. The source code is available at https://github.com/2noweyh/DILAB-Alibaba-Ecommerce-Search.
comment: CIKM Alibaba E-commerce Search Challenge 2025
☆ Learning to Navigate Under Imperfect Perception: Conformalised Segmentation for Safe Reinforcement Learning
Reliable navigation in safety-critical environments requires both accurate hazard perception and principled uncertainty handling to strengthen downstream safety handling. Despite the effectiveness of existing approaches, they assume perfect hazard detection capabilities, while uncertainty-aware perception approaches lack finite-sample guarantees. We present COPPOL, a conformal-driven perception-to-policy learning approach that integrates distribution-free, finite-sample safety guarantees into semantic segmentation, yielding calibrated hazard maps with rigorous bounds for missed detections. These maps induce risk-aware cost fields for downstream RL planning. Across two satellite-derived benchmarks, COPPOL increases hazard coverage (up to 6x) compared to comparative baselines, achieving near-complete detection of unsafe regions while reducing hazardous violations during navigation (up to approx 50%). More importantly, our approach remains robust to distributional shift, preserving both safety and efficiency.
☆ Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation
Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.
☆ Benchmarking Fairness-aware Graph Neural Networks in Knowledge Graphs
Graph neural networks (GNNs) are powerful tools for learning from graph-structured data but often produce biased predictions with respect to sensitive attributes. Fairness-aware GNNs have been actively studied for mitigating biased predictions. However, no prior studies have evaluated fairness-aware GNNs on knowledge graphs, which are one of the most important graphs in many applications, such as recommender systems. Therefore, we introduce a benchmarking study on knowledge graphs. We generate new graphs from three knowledge graphs, YAGO, DBpedia, and Wikidata, that are significantly larger than the existing graph datasets used in fairness studies. We benchmark inprocessing and preprocessing methods in different GNN backbones and early stopping conditions. We find several key insights: (i) knowledge graphs show different trends from existing datasets; clearer trade-offs between prediction accuracy and fairness metrics than other graphs in fairness-aware GNNs, (ii) the performance is largely affected by not only fairness-aware GNN methods but also GNN backbones and early stopping conditions, and (iii) preprocessing methods often improve fairness metrics, while inprocessing methods improve prediction accuracy.
☆ Simple and Efficient Heterogeneous Temporal Graph Neural Network
Heterogeneous temporal graphs (HTGs) are ubiquitous data structures in the real world. Recently, to enhance representation learning on HTGs, numerous attention-based neural networks have been proposed. Despite these successes, existing methods rely on a decoupled temporal and spatial learning paradigm, which weakens interactions of spatio-temporal information and leads to a high model complexity. To bridge this gap, we propose a novel learning paradigm for HTGs called Simple and Efficient Heterogeneous Temporal Graph N}eural Network (SE-HTGNN). Specifically, we innovatively integrate temporal modeling into spatial learning via a novel dynamic attention mechanism, which retains attention information from historical graph snapshots to guide subsequent attention computation, thereby improving the overall discriminative representations learning of HTGs. Additionally, to comprehensively and adaptively understand HTGs, we leverage large language models to prompt SE-HTGNN, enabling the model to capture the implicit properties of node types as prior knowledge. Extensive experiments demonstrate that SE-HTGNN achieves up to 10x speed-up over the state-of-the-art and latest baseline while maintaining the best forecasting accuracy.
comment: Accepted by Neurips 2025
☆ Learning Boltzmann Generators via Constrained Mass Transport
Efficient sampling from high-dimensional and multimodal unnormalized probability distributions is a central challenge in many areas of science and machine learning. We focus on Boltzmann generators (BGs) that aim to sample the Boltzmann distribution of physical systems, such as molecules, at a given temperature. Classical variational approaches that minimize the reverse Kullback-Leibler divergence are prone to mode collapse, while annealing-based methods, commonly using geometric schedules, can suffer from mass teleportation and rely heavily on schedule tuning. We introduce Constrained Mass Transport (CMT), a variational framework that generates intermediate distributions under constraints on both the KL divergence and the entropy decay between successive steps. These constraints enhance distributional overlap, mitigate mass teleportation, and counteract premature convergence. Across standard BG benchmarks and the here introduced ELIL tetrapeptide, the largest system studied to date without access to samples from molecular dynamics, CMT consistently surpasses state-of-the-art variational methods, achieving more than 2.5x higher effective sample size while avoiding mode collapse.
☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: Code and models available at: https://github.com/tianciB/VFM-VAE
☆ A machine learning approach to automation and uncertainty evaluation for self-validating thermocouples
Thermocouples are in widespread use in industry, but they are particularly susceptible to calibration drift in harsh environments. Self-validating thermocouples aim to address this issue by using a miniature phase-change cell (fixed-point) in close proximity to the measurement junction (tip) of the thermocouple. The fixed point is a crucible containing an ingot of metal with a known melting temperature. When the process temperature being monitored passes through the melting temperature of the ingot, the thermocouple output exhibits a "plateau" during melting. Since the melting temperature of the ingot is known, the thermocouple can be recalibrated in situ. Identifying the melting plateau to determine the onset of melting is reasonably well established but requires manual intervention involving zooming in on the region around the actual melting temperature, a process which can depend on the shape of the melting plateau. For the first time, we present a novel machine learning approach to recognize and identify the characteristic shape of the melting plateau and once identified, to quantity the point at which melting begins, along with its associated uncertainty. This removes the need for human intervention in locating and characterizing the melting point. Results from test data provided by CCPI Europe show 100% accuracy of melting plateau detection. They also show a cross-validated R2 of 0.99 on predictions of calibration drift.
comment: 6 pages, 7 figures. TEMPERATURE: ITS MEASUREMENT AND CONTROL IN SCIENCE AND INDUSTRY, VOLUME 9: Proceedings of the Tenth International Temperature Symposium 3-7 April 2023 Anaheim, USA
☆ Provable Generalization Bounds for Deep Neural Networks with Adaptive Regularization
Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop's effectiveness, we derive a tightened PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 20% sharper bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop outperforms baseline regularization techniques, including standard dropout and adaptive gradient regularization, by 1-2% in test accuracy on MNIST (99.52%) and CIFAR-10 (90.63%), with generalization gaps of 0.48% and 7.14%, respectively. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization suitable for high-stakes applications.
comment: 8 pages
☆ Learning from N-Tuple Data with M Positive Instances: Unbiased Risk Estimation and Theoretical Guarantees
Weakly supervised learning often operates with coarse aggregate signals rather than instance labels. We study a setting where each training example is an $n$-tuple containing exactly m positives, while only the count m per tuple is observed. This NTMP (N-tuple with M positives) supervision arises in, e.g., image classification with region proposals and multi-instance measurements. We show that tuple counts admit a trainable unbiased risk estimator (URE) by linking the tuple-generation process to latent instance marginals. Starting from fixed (n,m), we derive a closed-form URE and extend it to variable tuple sizes, variable counts, and their combination. Identification holds whenever the effective mixing rate is separated from the class prior. We establish generalization bounds via Rademacher complexity and prove statistical consistency with standard rates under mild regularity assumptions. To improve finite-sample stability, we introduce simple ReLU corrections to the URE that preserve asymptotic correctness. Across benchmarks converted to NTMP tasks, the approach consistently outperforms representative weak-supervision baselines and yields favorable precision-recall and F1 trade-offs. It remains robust under class-prior imbalance and across diverse tuple configurations, demonstrating that count-only supervision can be exploited effectively through a theoretically grounded and practically stable objective.
☆ Approximation Rates of Shallow Neural Networks: Barron Spaces, Activation Functions and Optimality Analysis
This paper investigates the approximation properties of shallow neural networks with activation functions that are powers of exponential functions. It focuses on the dependence of the approximation rate on the dimension and the smoothness of the function being approximated within the Barron function space. We examine the approximation rates of ReLU$^{k}$ activation functions, proving that the optimal rate cannot be achieved under $\ell^{1}$-bounded coefficients or insufficient smoothness conditions. We also establish optimal approximation rates in various norms for functions in Barron spaces and Sobolev spaces, confirming the curse of dimensionality. Our results clarify the limits of shallow neural networks' approximation capabilities and offer insights into the selection of activation functions and network structures.
☆ S2AP: Score-space Sharpness Minimization for Adversarial Pruning
Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.
☆ Training Diverse Graph Experts for Ensembles: A Systematic Empirical Study
Graph Neural Networks (GNNs) have become essential tools for learning on relational data, yet the performance of a single GNN is often limited by the heterogeneity present in real-world graphs. Recent advances in Mixture-of-Experts (MoE) frameworks demonstrate that assembling multiple, explicitly diverse GNNs with distinct generalization patterns can significantly improve performance. In this work, we present the first systematic empirical study of expert-level diversification techniques for GNN ensembles. Evaluating 20 diversification strategies -- including random re-initialization, hyperparameter tuning, architectural variation, directionality modeling, and training data partitioning -- across 14 node classification benchmarks, we construct and analyze over 200 ensemble variants. Our comprehensive evaluation examines each technique in terms of expert diversity, complementarity, and ensemble performance. We also uncovers mechanistic insights into training maximally diverse experts. These findings provide actionable guidance for expert training and the design of effective MoE frameworks on graph data. Our code is available at https://github.com/Hydrapse/bench-gnn-diversification.
☆ Towards Unsupervised Open-Set Graph Domain Adaptation via Dual Reprogramming NeurIPS 2025
Unsupervised Graph Domain Adaptation has become a promising paradigm for transferring knowledge from a fully labeled source graph to an unlabeled target graph. Existing graph domain adaptation models primarily focus on the closed-set setting, where the source and target domains share the same label spaces. However, this assumption might not be practical in the real-world scenarios, as the target domain might include classes that are not present in the source domain. In this paper, we investigate the problem of unsupervised open-set graph domain adaptation, where the goal is to not only correctly classify target nodes into the known classes, but also recognize previously unseen node types into the unknown class. Towards this end, we propose a novel framework called GraphRTA, which conducts reprogramming on both the graph and model sides. Specifically, we reprogram the graph by modifying target graph structure and node features, which facilitates better separation of known and unknown classes. Meanwhile, we also perform model reprogramming by pruning domain-specific parameters to reduce bias towards the source graph while preserving parameters that capture transferable patterns across graphs. Additionally, we extend the classifier with an extra dimension for the unknown class, thus eliminating the need of manually specified threshold in open-set recognition. Comprehensive experiments on several public datasets demonstrate that our proposed model can achieve satisfied performance compared with recent state-of-the-art baselines. Our source codes and datasets are publicly available at https://github.com/cszhangzhen/GraphRTA.
comment: Accepted by NeurIPS 2025
☆ Learning to Flow from Generative Pretext Tasks for Neural Architecture Encoding NeurIPS 2025
The performance of a deep learning model on a specific task and dataset depends heavily on its neural architecture, motivating considerable efforts to rapidly and accurately identify architectures suited to the target task and dataset. To achieve this, researchers use machine learning models-typically neural architecture encoders-to predict the performance of a neural architecture. Many state-of-the-art encoders aim to capture information flow within a neural architecture, which reflects how information moves through the forward pass and backpropagation, via a specialized model structure. However, due to their complicated structures, these flow-based encoders are significantly slower to process neural architectures compared to simpler encoders, presenting a notable practical challenge. To address this, we propose FGP, a novel pre-training method for neural architecture encoding that trains an encoder to capture the information flow without requiring specialized model structures. FGP trains an encoder to reconstruct a flow surrogate, our proposed representation of the neural architecture's information flow. Our experiments show that FGP boosts encoder performance by up to 106% in Precision-1%, compared to the same encoder trained solely with supervised learning.
comment: Published as a conference paper at NeurIPS 2025
☆ Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.
☆ Computable universal online learning NeurIPS 2025
Understanding when learning is possible is a fundamental task in the theory of machine learning. However, many characterizations known from the literature deal with abstract learning as a mathematical object and ignore the crucial question: when can learning be implemented as a computer program? We address this question for universal online learning, a generalist theoretical model of online binary classification, recently characterized by Bousquet et al. (STOC'21). In this model, there is no hypothesis fixed in advance; instead, Adversary -- playing the role of Nature -- can change their mind as long as local consistency with the given class of hypotheses is maintained. We require Learner to achieve a finite number of mistakes while using a strategy that can be implemented as a computer program. We show that universal online learning does not imply computable universal online learning, even if the class of hypotheses is relatively easy from a computability-theoretic perspective. We then study the agnostic variant of computable universal online learning and provide an exact characterization of classes that are learnable in this sense. We also consider a variant of proper universal online learning and show exactly when it is possible. Together, our results give a more realistic perspective on the existing theory of online binary classification and the related problem of inductive inference.
comment: Accepted for presentation at NeurIPS 2025
☆ PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion
State-of-the-art perceptive Reinforcement Learning controllers for legged robots either (i) impose oscillator or IK-based gait priors that constrain the action space, add bias to the policy optimization and reduce adaptability across robot morphologies, or (ii) operate "blind", which struggle to anticipate hind-leg terrain, and are brittle to noise. In this paper, we propose Phase-Guided Terrain Traversal (PGTT), a perception-aware deep-RL approach that overcomes these limitations by enforcing gait structure purely through reward shaping, thereby reducing inductive bias in policy learning compared to oscillator/IK-conditioned action priors. PGTT encodes per-leg phase as a cubic Hermite spline that adapts swing height to local heightmap statistics and adds a swing- phase contact penalty, while the policy acts directly in joint space supporting morphology-agnostic deployment. Trained in MuJoCo (MJX) on procedurally generated stair-like terrains with curriculum and domain randomization, PGTT achieves the highest success under push disturbances (median +7.5% vs. the next best method) and on discrete obstacles (+9%), with comparable velocity tracking, and converging to an effective policy roughly 2x faster than strong end-to-end baselines. We validate PGTT on a Unitree Go2 using a real-time LiDAR elevation-to-heightmap pipeline, and we report preliminary results on ANYmal-C obtained with the same hyperparameters. These findings indicate that terrain-adaptive, phase-guided reward shaping is a simple and general mechanism for robust perceptive locomotion across platforms.
comment: 9 pages, 9 figures, 2 tables
☆ Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs
The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $\gamma < 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $\gamma = 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $\gamma = 1$) can be replaced with a new object that we call the transient visitation measure.
☆ ECG-LLM-- training and evaluation of domain-specific large language models for electrocardiography
Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.
comment: 34 pages, 8 figures, code available at https://github.com/AI4HealthUOL/ecg-llm
☆ Parametrising the Inhomogeneity Inducing Capacity of a Training Set, and its Impact on Supervised Learning
We introduce parametrisation of that property of the available training dataset, that necessitates an inhomogeneous correlation structure for the function that is learnt as a model of the relationship between the pair of variables, observations of which comprise the considered training data. We refer to a parametrisation of this property of a given training set, as its ``inhomogeneity parameter''. It is easy to compute this parameter for small-to-large datasets, and we demonstrate such computation on multiple publicly-available datasets, while also demonstrating that conventional ``non-stationarity'' of data does not imply a non-zero inhomogeneity parameter of the dataset. We prove that - within the probabilistic Gaussian Process-based learning approach - a training set with a non-zero inhomogeneity parameter renders it imperative, that the process that is invoked to model the sought function, be non-stationary. Following the learning of a real-world multivariate function with such a Process, quality and reliability of predictions at test inputs, are demonstrated to be affected by the inhomogeneity parameter of the training data.
☆ Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching NeurIPS 2025
We introduce Time-Conditioned Contraction Matching (TCCM), a novel method for semi-supervised anomaly detection in tabular data. TCCM is inspired by flow matching, a recent generative modeling framework that learns velocity fields between probability distributions and has shown strong performance compared to diffusion models and generative adversarial networks. Instead of directly applying flow matching as originally formulated, TCCM builds on its core idea -- learning velocity fields between distributions -- but simplifies the framework by predicting a time-conditioned contraction vector toward a fixed target (the origin) at each sampled time step. This design offers three key advantages: (1) a lightweight and scalable training objective that removes the need for solving ordinary differential equations during training and inference; (2) an efficient scoring strategy called one time-step deviation, which quantifies deviation from expected contraction behavior in a single forward pass, addressing the inference bottleneck of existing continuous-time models such as DTE (a diffusion-based model with leading anomaly detection accuracy but heavy inference cost); and (3) explainability and provable robustness, as the learned velocity field operates directly in input space, making the anomaly score inherently feature-wise attributable; moreover, the score function is Lipschitz-continuous with respect to the input, providing theoretical guarantees under small perturbations. Extensive experiments on the ADBench benchmark show that TCCM strikes a favorable balance between detection accuracy and inference cost, outperforming state-of-the-art methods -- especially on high-dimensional and large-scale datasets. The source code is available at our GitHub repository.
comment: Paper accepted by NeurIPS 2025
☆ Uncertainty Estimation by Flexible Evidential Deep Learning NeurIPS 2025
Uncertainty quantification (UQ) is crucial for deploying machine learning models in high-stakes applications, where overconfident predictions can lead to serious consequences. An effective UQ method must balance computational efficiency with the ability to generalize across diverse scenarios. Evidential deep learning (EDL) achieves efficiency by modeling uncertainty through the prediction of a Dirichlet distribution over class probabilities. However, the restrictive assumption of Dirichlet-distributed class probabilities limits EDL's robustness, particularly in complex or unforeseen situations. To address this, we propose \textit{flexible evidential deep learning} ($\mathcal{F}$-EDL), which extends EDL by predicting a flexible Dirichlet distribution -- a generalization of the Dirichlet distribution -- over class probabilities. This approach provides a more expressive and adaptive representation of uncertainty, significantly enhancing UQ generalization and reliability under challenging scenarios. We theoretically establish several advantages of $\mathcal{F}$-EDL and empirically demonstrate its state-of-the-art UQ performance across diverse evaluation settings, including classical, long-tailed, and noisy in-distribution scenarios.
comment: NeurIPS 2025
☆ MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Imitation learning from large-scale, diverse human demonstrations has proven effective for training robots, but collecting such data is costly and time-consuming. This challenge is amplified for multi-step bimanual mobile manipulation, where humans must teleoperate both a mobile base and two high-degree-of-freedom arms. Prior automated data generation frameworks have addressed static bimanual manipulation by augmenting a few human demonstrations in simulation, but they fall short for mobile settings due to two key challenges: (1) determining base placement to ensure reachability, and (2) positioning the camera to provide sufficient visibility for visuomotor policies. To address these issues, we introduce MoMaGen, which formulates data generation as a constrained optimization problem that enforces hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility during navigation). This formulation generalizes prior approaches and provides a principled foundation for future methods. We evaluate MoMaGen on four multi-step bimanual mobile manipulation tasks and show that it generates significantly more diverse datasets than existing methods. Leveraging this diversity, MoMaGen can train successful imitation learning policies from a single source demonstration, and these policies can be fine-tuned with as few as 40 real-world demonstrations to achieve deployment on physical robotic hardware. More details are available at our project page: momagen.github.io.
comment: Project website: momagen.github.io. The first four authors contribute equally
☆ Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task
We investigate how embedding dimension affects the emergence of an internal "world model" in a transformer trained with reinforcement learning to perform bubble-sort-style adjacent swaps. Models achieve high accuracy even with very small embedding dimensions, but larger dimensions yield more faithful, consistent, and robust internal representations. In particular, higher embedding dimensions strengthen the formation of structured internal representation and lead to better interpretability. After hundreds of experiments, we observe two consistent mechanisms: (1) the last row of the attention weight matrix monotonically encodes the global ordering of tokens; and (2) the selected transposition aligns with the largest adjacent difference of these encoded values. Our results provide quantitative evidence that transformers build structured internal world models and that model size improves representation quality in addition to end performance. We release our metrics and analyses, which can be used to probe similar algorithmic tasks.
☆ Towards Identifiability of Hierarchical Temporal Causal Representation Learning
Modeling hierarchical latent dynamics behind time series data is critical for capturing temporal dependencies across multiple levels of abstraction in real-world tasks. However, existing temporal causal representation learning methods fail to capture such dynamics, as they fail to recover the joint distribution of hierarchical latent variables from \textit{single-timestep observed variables}. Interestingly, we find that the joint distribution of hierarchical latent variables can be uniquely determined using three conditionally independent observations. Building on this insight, we propose a Causally Hierarchical Latent Dynamic (CHiLD) identification framework. Our approach first employs temporal contextual observed variables to identify the joint distribution of multi-layer latent variables. Sequentially, we exploit the natural sparsity of the hierarchical structure among latent variables to identify latent variables within each layer. Guided by the theoretical results, we develop a time series generative model grounded in variational inference. This model incorporates a contextual encoder to reconstruct multi-layer latent variables and normalize flow-based hierarchical prior networks to impose the independent noise condition of hierarchical latent dynamics. Empirical evaluations on both synthetic and real-world datasets validate our theoretical claims and demonstrate the effectiveness of CHiLD in modeling hierarchical latent dynamics.
☆ A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces
Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance analysis both computationally expensive and time-consuming. To address this challenge, we present an end-to-end parallel performance analysis framework designed to handle multiple large-scale GPU traces efficiently. Our proposed framework partitions and processes trace data concurrently and employs causal graph methods and parallel coordinating chart to expose performance variability and dependencies across execution flows. Experimental results demonstrate a 67% improvement in terms of scalability, highlighting the effectiveness of our pipeline for analyzing multiple traces independently.
☆ Physics-Informed Parametric Bandits for Beam Alignment in mmWave Communications
In millimeter wave (mmWave) communications, beam alignment and tracking are crucial to combat the significant path loss. As scanning the entire directional space is inefficient, designing an efficient and robust method to identify the optimal beam directions is essential. Since traditional bandit algorithms require a long time horizon to converge under large beam spaces, many existing works propose efficient bandit algorithms for beam alignment by relying on unimodality or multimodality assumptions on the reward function's structure. However, such assumptions often do not hold (or cannot be strictly satisfied) in practice, which causes such algorithms to converge to choosing suboptimal beams. In this work, we propose two physics-informed bandit algorithms \textit{pretc} and \textit{prgreedy} that exploit the sparse multipath property of mmWave channels - a generic but realistic assumption - which is connected to the Phase Retrieval Bandit problem. Our algorithms treat the parameters of each path as black boxes and maintain optimal estimates of them based on sampled historical rewards. \textit{pretc} starts with a random exploration phase and then commits to the optimal beam under the estimated reward function. \textit{prgreedy} performs such estimation in an online manner and chooses the best beam under current estimates. Our algorithms can also be easily adapted to beam tracking in the mobile setting. Through experiments using both the synthetic DeepMIMO dataset and the real-world DeepSense6G dataset, we demonstrate that both algorithms outperform existing approaches in a wide range of scenarios across diverse channel environments, showing their generalizability and robustness.
Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: https://vishal-vinod.github.io/gmpi-edit/
comment: 14 pages, 7 figures
☆ Online Time Series Forecasting with Theoretical Guarantees
This paper is concerned with online time series forecasting, where unknown distribution shifts occur over time, i.e., latent variables influence the mapping from historical to future observations. To develop an automated way of online time series forecasting, we propose a Theoretical framework for Online Time-series forecasting (TOT in short) with theoretical guarantees. Specifically, we prove that supplying a forecaster with latent variables tightens the Bayes risk, the benefit endures under estimation uncertainty of latent variables and grows as the latent variables achieve a more precise identifiability. To better introduce latent variables into online forecasting algorithms, we further propose to identify latent variables with minimal adjacent observations. Based on these results, we devise a model-agnostic blueprint by employing a temporal decoder to match the distribution of observed variables and two independent noise estimators to model the causal inference of latent variables and mixing procedures of observed variables, respectively. Experiment results on synthetic data support our theoretical claims. Moreover, plug-in implementations built on several baselines yield general improvement across multiple benchmarks, highlighting the effectiveness in real-world applications.
SPIKE: Stable Physics-Informed Kernel Evolution Method for Solving Hyperbolic Conservation Laws
We introduce the Stable Physics-Informed Kernel Evolution (SPIKE) method for numerical computation of inviscid hyperbolic conservation laws. SPIKE resolves a fundamental paradox: how strong-form residual minimization can capture weak solutions containing discontinuities. SPIKE employs reproducing kernel representations with regularized parameter evolution, where Tikhonov regularization provides a smooth transition mechanism through shock formation, allowing the dynamics to traverse shock singularities. This approach automatically maintains conservation, tracks characteristics, and captures shocks satisfying Rankine-Hugoniot conditions within a unified framework requiring no explicit shock detection or artificial viscosity. Numerical validation across scalar and vector-valued conservation laws confirms the method's effectiveness.
comment: 24 pages, 8 figures
☆ From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
☆ Learning under Quantization for High-Dimensional Linear Regression
The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization targets: data, labels, parameters, activations, and gradients. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum; and data and label quantization introduce additional approximation and quantized error. Crucially, we prove that for multiplicative quantization (with input-dependent quantization step), this spectral distortion can be eliminated, and for additive quantization (with constant quantization step), a beneficial scaling effect with batch size emerges. Furthermore, for common polynomial-decay data spectra, we quantitatively compare the risks of multiplicative and additive quantization, drawing a parallel to the comparison between FP and integer quantization methods. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.
☆ NTKMTL: Mitigating Task Imbalance in Multi-Task Learning from Neural Tangent Kernel Perspective
Multi-Task Learning (MTL) enables a single model to learn multiple tasks simultaneously, leveraging knowledge transfer among tasks for enhanced generalization, and has been widely applied across various domains. However, task imbalance remains a major challenge in MTL. Although balancing the convergence speeds of different tasks is an effective approach to address this issue, it is highly challenging to accurately characterize the training dynamics and convergence speeds of multiple tasks within the complex MTL system. To this end, we attempt to analyze the training dynamics in MTL by leveraging Neural Tangent Kernel (NTK) theory and propose a new MTL method, NTKMTL. Specifically, we introduce an extended NTK matrix for MTL and adopt spectral analysis to balance the convergence speeds of multiple tasks, thereby mitigating task imbalance. Based on the approximation via shared representation, we further propose NTKMTL-SR, achieving training efficiency while maintaining competitive performance. Extensive experiments demonstrate that our methods achieve state-of-the-art performance across a wide range of benchmarks, including both multi-task supervised learning and multi-task reinforcement learning. Source code is available at https://github.com/jianke0604/NTKMTL.
☆ Illusions of reflection: open-ended task reveals systematic failures in Large Language Models' reflective reasoning
Humans do not just find mistakes after the fact -- we often catch them mid-stream because 'reflection' is tied to the goal and its constraints. Today's large language models produce reasoning tokens and 'reflective' text, but is it functionally equivalent with human reflective reasoning? Prior work on closed-ended tasks -- with clear, external 'correctness' signals -- can make 'reflection' look effective while masking limits in self-correction. We therefore test eight frontier models on a simple, real-world task that is open-ended yet rule-constrained, with auditable success criteria: to produce valid scientific test items, then revise after considering their own critique. First-pass performance is poor (often zero valid items out of 4 required; mean $\approx$ 1), and reflection yields only modest gains (also $\approx$ 1). Crucially, the second attempt frequently repeats the same violation of constraint, indicating 'corrective gains' arise largely from chance production of a valid item rather than error detection and principled, constraint-sensitive repair. Performance before and after reflection deteriorates as open-endedness increases, and models marketed for 'reasoning' show no advantage. Our results suggest that current LLM 'reflection' lacks functional evidence of the active, goal-driven monitoring that helps humans respect constraints even on a first pass. Until such mechanisms are instantiated in the model itself, reliable performance requires external structure that enforces constraints.
☆ Finding the Sweet Spot: Optimal Data Augmentation Ratio for Imbalanced Credit Scoring Using ADASYN
Credit scoring models face a critical challenge: severe class imbalance, with default rates typically below 10%, which hampers model learning and predictive performance. While synthetic data augmentation techniques such as SMOTE and ADASYN have been proposed to address this issue, the optimal augmentation ratio remains unclear, with practitioners often defaulting to full balancing (1:1 ratio) without empirical justification. This study systematically evaluates 10 data augmentation scenarios using the Give Me Some Credit dataset (97,243 observations, 7% default rate), comparing SMOTE, BorderlineSMOTE, and ADASYN at different multiplication factors (1x, 2x, 3x). All models were trained using XGBoost and evaluated on a held-out test set of 29,173 real observations. Statistical significance was assessed using bootstrap testing with 1,000 iterations. Key findings reveal that ADASYN with 1x multiplication (doubling the minority class) achieved optimal performance with AUC of 0.6778 and Gini coefficient of 0.3557, representing statistically significant improvements of +0.77% and +3.00% respectively (p = 0.017, bootstrap test). Higher multiplication factors (2x and 3x) resulted in performance degradation, with 3x showing a -0.48% decrease in AUC, suggesting a "law of diminishing returns" for synthetic oversampling. The optimal class imbalance ratio was found to be 6.6:1 (majority:minority), contradicting the common practice of balancing to 1:1. This work provides the first empirical evidence of an optimal "sweet spot" for data augmentation in credit scoring, with practical guidelines for industry practitioners and researchers working with imbalanced datasets. While demonstrated on a single representative dataset, the methodology provides a reproducible framework for determining optimal augmentation ratios in other imbalanced domains.
comment: 25 pages, 3 figures, 6 tables
☆ Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.
comment: 27 pages, 17 figures
☆ Learning with Dual-level Noisy Correspondence for Multi-modal Entity Alignment
Multi-modal entity alignment (MMEA) aims to identify equivalent entities across heterogeneous multi-modal knowledge graphs (MMKGs), where each entity is described by attributes from various modalities. Existing methods typically assume that both intra-entity and inter-graph correspondences are faultless, which is often violated in real-world MMKGs due to the reliance on expert annotations. In this paper, we reveal and study a highly practical yet under-explored problem in MMEA, termed Dual-level Noisy Correspondence (DNC). DNC refers to misalignments in both intra-entity (entity-attribute) and inter-graph (entity-entity and attribute-attribute) correspondences. To address the DNC problem, we propose a robust MMEA framework termed RULE. RULE first estimates the reliability of both intra-entity and inter-graph correspondences via a dedicated two-fold principle. Leveraging the estimated reliabilities, RULE mitigates the negative impact of intra-entity noise during attribute fusion and prevents overfitting to noisy inter-graph correspondences during inter-graph discrepancy elimination. Beyond the training-time designs, RULE further incorporates a correspondence reasoning module that uncovers the underlying attribute-attribute connection across graphs, guaranteeing more accurate equivalent entity identification. Extensive experiments on five benchmarks verify the effectiveness of our method against the DNC compared with seven state-of-the-art methods.The code is available at \href{https://github.com/XLearning-SCU/RULE}{XLearning-SCU/RULE}
comment: 30 pages, 12 figures
☆ LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling
Scaling large recommendation systems requires advancing three major frontiers: processing longer user histories, expanding candidate sets, and increasing model capacity. While promising, transformers' computational cost scales quadratically with the user sequence length and linearly with the number of candidates. This trade-off makes it prohibitively expensive to expand candidate sets or increase sequence length at inference, despite the significant performance improvements. We introduce \textbf{LIME}, a novel architecture that resolves this trade-off. Through two key innovations, LIME fundamentally reduces computational complexity. First, low-rank ``link embeddings" enable pre-computation of attention weights by decoupling user and candidate interactions, making the inference cost nearly independent of candidate set size. Second, a linear attention mechanism, \textbf{LIME-XOR}, reduces the complexity with respect to user sequence length from quadratic ($O(N^2)$) to linear ($O(N)$). Experiments on public and industrial datasets show LIME achieves near-parity with state-of-the-art transformers but with a 10$\times$ inference speedup on large candidate sets or long sequence lengths. When tested on a major recommendation platform, LIME improved user engagement while maintaining minimal inference costs with respect to candidate set size and user history length, establishing a new paradigm for efficient and expressive recommendation systems.
comment: 16 pages
☆ Fostering the Ecosystem of AI for Social Impact Requires Expanding and Strengthening Evaluation Standards NeurIPS 2025
There has been increasing research interest in AI/ML for social impact, and correspondingly more publication venues have refined review criteria for practice-driven AI/ML research. However, these review guidelines tend to most concretely recognize projects that simultaneously achieve deployment and novel ML methodological innovation. We argue that this introduces incentives for researchers that undermine the sustainability of a broader research ecosystem of social impact, which benefits from projects that make contributions on single front (applied or methodological) that may better meet project partner needs. Our position is that researchers and reviewers in machine learning for social impact must simultaneously adopt: 1) a more expansive conception of social impacts beyond deployment and 2) more rigorous evaluations of the impact of deployed systems.
comment: Accepted at NeurIPS 2025
☆ ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control
Generating high-quality synthetic text under differential privacy (DP) is critical for training and evaluating language models without compromising user privacy. Prior work on synthesizing DP datasets often fail to preserve key statistical attributes, suffer utility loss from the noise required by DP, and lack fine-grained control over generation. To address these challenges, we make two contributions. First, we introduce a hierarchical framework that decomposes DP synthetic text generation into two subtasks: feature learning and conditional text generation. This design explicitly incorporates learned features into the generation process and simplifies the end-to-end synthesis task. Through systematic ablations, we identify the most effective configuration: a rich tabular schema as feature, a DP tabular synthesizer, and a DP fine-tuned conditional generator, which we term ACTG (Attribute-Conditioned Text Generation). Second, we propose Anchored RL (ARL), a post-training method that improves the instruction-following ability of ACTG for conditional generation. ARL combines RL to boost control with an SFT anchor on best-of-$N$ data to prevent reward hacking. Together, these components form our end-to-end algorithm ACTG-ARL, which advances both the quality of DP synthetic text (+20% MAUVE over prior work) and the control of the conditional generator under strong privacy guarantees.
☆ Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations
Fine-tuning large language models (LLMs) using zeroth-order (ZO) optimization has emerged as a promising alternative to traditional gradient-based methods due to its reduced memory footprint requirement. However, existing ZO methods suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models. In this work, we propose P-GAP, a fast LLM fine-tuning approach through zeroth-order optimization with Projected Gradient-Aligned Perturbations. Specifically, we first estimate a low-dimensional gradient space and then align perturbations in projected gradients' direction within the space. This approach enables reduced the number of perturbed parameters and decreased variance, therefore accelerated convergence for LLM fine-tuning. Experiments on LLMs show that P-GAP consistently surpasses the baselines, achieving up to 6% increase in accuracy on classification tasks and up to 12% higher accuracy on generation tasks, with up to about 81% less training iterations and 70% less GPU hours. These results demonstrate that P-GAP enables fast, scalable, and resource-efficient ZO LLM fine-tuning.
comment: 10 pages, 5 figures
☆ Joint Optimization of Cooperation Efficiency and Communication Covertness for Target Detection with AUVs
This paper investigates underwater cooperative target detection using autonomous underwater vehicles (AUVs), with a focus on the critical trade-off between cooperation efficiency and communication covertness. To tackle this challenge, we first formulate a joint trajectory and power control optimization problem, and then present an innovative hierarchical action management framework to solve it. According to the hierarchical formulation, at the macro level, the master AUV models the agent selection process as a Markov decision process and deploys the proximal policy optimization algorithm for strategic task allocation. At the micro level, each selected agent's decentralized decision-making is modeled as a partially observable Markov decision process, and a multi-agent proximal policy optimization algorithm is used to dynamically adjust its trajectory and transmission power based on its local observations. Under the centralized training and decentralized execution paradigm, our target detection framework enables adaptive covert cooperation while satisfying both energy and mobility constraints. By comprehensively modeling the considered system, the involved signals and tasks, as well as energy consumption, theoretical insights and practical solutions for the efficient and secure operation of multiple AUVs are provided, offering significant implications for the execution of underwater covert communication tasks.
☆ The Bias-Variance Tradeoff in Data-Driven Optimization: A Local Misspecification Perspective
Data-driven stochastic optimization is ubiquitous in machine learning and operational decision-making problems. Sample average approximation (SAA) and model-based approaches such as estimate-then-optimize (ETO) or integrated estimation-optimization (IEO) are all popular, with model-based approaches being able to circumvent some of the issues with SAA in complex context-dependent problems. Yet the relative performance of these methods is poorly understood, with most results confined to the dichotomous cases of the model-based approach being either well-specified or misspecified. We develop the first results that allow for a more granular analysis of the relative performance of these methods under a local misspecification setting, which models the scenario where the model-based approach is nearly well-specified. By leveraging tools from contiguity theory in statistics, we show that there is a bias-variance tradeoff between SAA, IEO, and ETO under local misspecification, and that the relative importance of the bias and the variance depends on the degree of local misspecification. Moreover, we derive explicit expressions for the decision bias, which allows us to characterize (un)impactful misspecification directions, and provide further geometric understanding of the variance.
☆ VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
comment: 10 pages, 5 figures, 4 tables. Under review
A Definition of AGI
The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly "jagged" cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 58%) concretely quantify both rapid progress and the substantial gap remaining before AGI.
☆ RESCUE: Retrieval Augmented Secure Code Generation
Despite recent advances, Large Language Models (LLMs) still generate vulnerable code. Retrieval-Augmented Generation (RAG) has the potential to enhance LLMs for secure code generation by incorporating external security knowledge. However, the conventional RAG design struggles with the noise of raw security-related documents, and existing retrieval methods overlook the significant security semantics implicitly embedded in task descriptions. To address these issues, we propose RESCUE, a new RAG framework for secure code generation with two key innovations. First, we propose a hybrid knowledge base construction method that combines LLM-assisted cluster-then-summarize distillation with program slicing, producing both high-level security guidelines and concise, security-focused code examples. Second, we design a hierarchical multi-faceted retrieval to traverse the constructed knowledge base from top to bottom and integrates multiple security-critical facts at each hierarchical level, ensuring comprehensive and accurate retrieval. We evaluated RESCUE on four benchmarks and compared it with five state-of-the-art secure code generation methods on six LLMs. The results demonstrate that RESCUE improves the SecurePass@1 metric by an average of 4.8 points, establishing a new state-of-the-art performance for security. Furthermore, we performed in-depth analysis and ablation studies to rigorously validate the effectiveness of individual components in RESCUE.
☆ Ensemble based Closed-Loop Optimal Control using Physics-Informed Neural Networks
The objective of designing a control system is to steer a dynamical system with a control signal, guiding it to exhibit the desired behavior. The Hamilton-Jacobi-Bellman (HJB) partial differential equation offers a framework for optimal control system design. However, numerical solutions to this equation are computationally intensive, and analytical solutions are frequently unavailable. Knowledge-guided machine learning methodologies, such as physics-informed neural networks (PINNs), offer new alternative approaches that can alleviate the difficulties of solving the HJB equation numerically. This work presents a multistage ensemble framework to learn the optimal cost-to-go, and subsequently the corresponding optimal control signal, through the HJB equation. Prior PINN-based approaches rely on a stabilizing the HJB enforcement during training. Our framework does not use stabilizer terms and offers a means of controlling the nonlinear system, via either a singular learned control signal or an ensemble control signal policy. Success is demonstrated in closed-loop control, using both ensemble- and singular-control, of a steady-state time-invariant two-state continuous nonlinear system with an infinite time horizon, accounting of noisy, perturbed system states and varying initial conditions.
☆ FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo
Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emph{FST.ai 2.0}, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates {pose-based action recognition} using graph convolutional networks (GCNs), {epistemic uncertainty modeling} through credal sets, and {explainability overlays} for visual decision support. A set of {interactive dashboards} enables human--AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, FST.ai~2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an {85\% reduction in decision review time} and {93\% referee trust} in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, FST.ai~2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.
comment: 23 pages, 12 figures
☆ Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network
Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.
comment: Paper submitted to ICASSP2026
☆ ActivationReasoning: Logical Reasoning in Latent Activation Spaces
Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.
☆ Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria
Finding Nash equilibria in imperfect-information games remains a central challenge in multi-agent reinforcement learning. While regularization-based methods have recently achieved last-iteration convergence to a regularized equilibrium, they require the regularization strength to shrink toward zero to approximate a Nash equilibrium, often leading to unstable learning in practice. Instead, we fix the regularization strength at a large value for robustness and achieve convergence by iteratively refining the reference policy. Our main theoretical result shows that this procedure guarantees strictly monotonic improvement and convergence to an exact Nash equilibrium in two-player zero-sum games, without requiring a uniqueness assumption. Building on this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which preserves the generalizability of policy gradient methods while relying solely on the current and reference policies. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold'em, where NashPG consistently attains higher Elo ratings.
☆ A Multi-Evidence Framework Rescues Low-Power Prognostic Signals and Rejects Statistical Artifacts in Cancer Genomics
Motivation: Standard genome-wide association studies in cancer genomics rely on statistical significance with multiple testing correction, but systematically fail in underpowered cohorts. In TCGA breast cancer (n=967, 133 deaths), low event rates (13.8%) create severe power limitations, producing false negatives for known drivers and false positives for large passenger genes. Results: We developed a five-criteria computational framework integrating causal inference (inverse probability weighting, doubly robust estimation) with orthogonal biological validation (expression, mutation patterns, literature evidence). Applied to TCGA-BRCA mortality analysis, standard Cox+FDR detected zero genes at FDR<0.05, confirming complete failure in underpowered settings. Our framework correctly identified RYR2 -- a cardiac gene with no cancer function -- as a false positive despite nominal significance (p=0.024), while identifying KMT2C as a complex candidate requiring validation despite marginal significance (p=0.047, q=0.954). Power analysis revealed median power of 15.1% across genes, with KMT2C achieving only 29.8% power (HR=1.55), explaining borderline statistical significance despite strong biological evidence. The framework distinguished true signals from artifacts through mutation pattern analysis: RYR2 showed 29.8% silent mutations (passenger signature) with no hotspots, while KMT2C showed 6.7% silent mutations with 31.4% truncating variants (driver signature). This multi-evidence approach provides a template for analyzing underpowered cohorts, prioritizing biological interpretability over purely statistical significance. Availability: All code and analysis pipelines available at github.com/akarlaraytu/causal-inference-for-cancer-genomics
comment: 17 pages (main text), 4 figures (main text), 7 supplementary figures, 4 supplementary tables. Focuses on a computational framework using causal inference and biological validation for underpowered cancer genomic studies
☆ PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion
State-of-the-art perceptive Reinforcement Learning controllers for legged robots either (i) impose oscillator or IK-based gait priors that constrain the action space, add bias to the policy optimization and reduce adaptability across robot morphologies, or (ii) operate "blind", which struggle to anticipate hind-leg terrain, and are brittle to noise. In this paper, we propose Phase-Guided Terrain Traversal (PGTT), a perception-aware deep-RL approach that overcomes these limitations by enforcing gait structure purely through reward shaping, thereby reducing inductive bias in policy learning compared to oscillator/IK-conditioned action priors. PGTT encodes per-leg phase as a cubic Hermite spline that adapts swing height to local heightmap statistics and adds a swing-phase contact penalty, while the policy acts directly in joint space supporting morphology-agnostic deployment. Trained in MuJoCo (MJX) on procedurally generated stair-like terrains with curriculum and domain randomization, PGTT achieves the highest success under push disturbances (median +7.5% vs. the next best method) and on discrete obstacles (+9%), with comparable velocity tracking, and converging to an effective policy roughly 2x faster than strong end-to-end baselines. We validate PGTT on a Unitree Go2 using a real-time LiDAR elevation-to-heightmap pipeline, and we report preliminary results on ANYmal-C obtained with the same hyperparameters. These findings indicate that terrain-adaptive, phase-guided reward shaping is a simple and general mechanism for robust perceptive locomotion across platforms.
comment: 9 pages, 9 figures, 2 tables
☆ ECG-LLM -- training and evaluation of domain-specific large language models for electrocardiography
Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.
comment: 34 pages, 8 figures, code available at https://github.com/AI4HealthUOL/ecg-llm
♻ ☆ Glyph: Scaling Context Windows via Visual-Text Compression
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
♻ ☆ Denoising the Future: Top-p Distributions for Moving Through Time
Inference in dynamic probabilistic models is a complex task involving expensive operations. In particular, for Hidden Markov Models, the whole state space has to be enumerated for advancing in time. Even states with negligible probabilities are considered, resulting in computational inefficiency and increased noise due to the propagation of unlikely probability mass. We propose to denoise the future and speed up inference by using only the top-p states, i.e., the most probable states with accumulated probability p. We show that the error introduced by using only the top-p states is bound by p and the so-called minimal mixing rate of the underlying model. Moreover, in our empirical evaluation, we show that we can expect speedups of at least an order of magnitude, while the error in terms of total variation distance is below 0.09.
comment: Accepted at ECSQARU 2025
♻ ☆ Enabling Fine-Grained Operating Points for Black-Box LLMs ICLR 2026
Black-box Large Language Models (LLMs) provide practical and accessible alternatives to other machine learning methods, as they require minimal labeled data and machine learning expertise to develop solutions for various decision making problems. However, for applications that need operating with constraints on specific metrics (e.g., precision $\geq$ 95%), decision making with black-box LLMs remains unfavorable, due to their low numerical output cardinalities. This results in limited control over their operating points, preventing fine-grained adjustment of their decision making behavior. In this paper, we study using black-box LLMs as classifiers, focusing on efficiently improving their operational granularity without performance loss. Specifically, we first investigate the reasons behind their low-cardinality numerical outputs and show that they are biased towards generating rounded but informative verbalized probabilities. Then, we experiment with standard prompt engineering, uncertainty estimation and confidence elicitation techniques, and observe that they do not effectively improve operational granularity without sacrificing performance or increasing inference cost. Finally, we propose efficient approaches to significantly increase the number and diversity of available operating points. Our proposed approaches provide finer-grained operating points and achieve comparable to or better performance than the benchmark methods across 11 datasets and 3 LLMs.
comment: Under review at ICLR 2026. 36 pages, 17 figures
♻ ☆ TrajMamba: An Efficient and Semantic-rich Vehicle Trajectory Pre-training Model NeurIPS2025
Vehicle GPS trajectories record how vehicles move over time, storing valuable travel semantics, including movement patterns and travel purposes. Learning travel semantics effectively and efficiently is crucial for real-world applications of trajectory data, which is hindered by two major challenges. First, travel purposes are tied to the functions of the roads and points-of-interest (POIs) involved in a trip. Such information is encoded in textual addresses and descriptions and introduces heavy computational burden to modeling. Second, real-world trajectories often contain redundant points, which harm both computational efficiency and trajectory embedding quality. To address these challenges, we propose TrajMamba, a novel approach for efficient and semantically rich vehicle trajectory learning. TrajMamba introduces a Traj-Mamba Encoder that captures movement patterns by jointly modeling both GPS and road perspectives of trajectories, enabling robust representations of continuous travel behaviors. It also incorporates a Travel Purpose-aware Pre-training procedure to integrate travel purposes into the learned embeddings without introducing extra overhead to embedding calculation. To reduce redundancy in trajectories, TrajMamba features a Knowledge Distillation Pre-training scheme to identify key trajectory points through a learnable mask generator and obtain effective compressed trajectory embeddings. Extensive experiments on two real-world datasets and three downstream tasks show that TrajMamba outperforms state-of-the-art baselines in both efficiency and accuracy.
comment: Accepted by NeurIPS2025
♻ ☆ FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization
The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.
comment: 10 pages, 7 figures, 2 tables
♻ ☆ SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
comment: Project Website: http://simbench.tiancheng.hu/ Data: https://huggingface.co/datasets/pitehu/SimBench
♻ ☆ Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
comment: 16 pages, 12 figures
♻ ☆ Language Models are Injective and Hence Invertible
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
♻ ☆ MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature
Synthesis procedures play a critical role in materials research, as they directly affect material properties. With data-driven approaches increasingly accelerating materials discovery, there is growing interest in extracting synthesis procedures from scientific literature as structured data. However, existing studies often rely on rigid, domain-specific schemas with predefined fields for structuring synthesis procedures or assume that synthesis procedures are linear sequences of operations, which limits their ability to capture the structural complexity of real-world procedures. To address these limitations, we adopt PROV-DM, an international standard for provenance information, which supports flexible, graph-based modeling of procedures. We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature using large language models. MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs. This representation enables machine-interpretable synthesis knowledge, opening opportunities for future research such as automated synthesis planning and optimization.
♻ ☆ Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability
Latent component identification from unknown nonlinear mixtures is a foundational challenge in machine learning, with applications in tasks such as disentangled representation learning and causal inference. Prior work in nonlinear independent component analysis (nICA) has shown that auxiliary signals -- such as weak supervision -- can support identifiability of conditionally independent latent components. More recent approaches explore structural assumptions, e.g., sparsity in the Jacobian of the mixing function, to relax such requirements. In this work, we introduce Diverse Influence Component Analysis (DICA), a framework that exploits the convex geometry of the mixing function's Jacobian. We propose a Jacobian Volume Maximization (J-VolMax) criterion, which enables latent component identification by encouraging diversity in their influence on the observed variables. Under reasonable conditions, this approach achieves identifiability without relying on auxiliary information, latent component independence, or Jacobian sparsity assumptions. These results extend the scope of identifiability analysis and offer a complementary perspective to existing methods.
comment: 30 pages, 3 figures
♻ ☆ Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration
Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.
♻ ☆ High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services
The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.
♻ ☆ One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit's update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.
comment: Journal extension of v1: Y. Min, K, Ahn, N. Azizan, "One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares," IEEE Conference on Decision and Control, 2022
♻ ☆ How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks -- which requires to recognize the \emph{positional} association between a pair of tokens from in-context examples. Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the \emph{one} gradient descent step. It remains unclear what is the on-convergence behavior of transformers being trained by gradient descent and how fast the convergence rate is. In addition, the generalization of transformers in one-step in-context reasoning has not been formally investigated. This work addresses these gaps. We first show that a class of transformers with either linear, ReLU or softmax attentions, is provably Bayes-optimal for an in-context recall task. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss converges at linear rate to the Bayes risks. Moreover, we show that the trained transformers exhibit out-of-distribution (OOD) generalization, i.e., generalizing to samples outside of the population distribution. Our theoretical findings are further supported by extensive empirical validations, showing that \emph{without} proper parameterization, models with larger expressive power surprisingly \emph{fail} to generalize OOD after being trained by gradient descent.
comment: V3: added new results for softmax attention, typos fixed, titled changed. 33 pages
♻ ☆ Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks
Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present NAO: a Nondeterministic tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. NAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement NAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. NAO reconciles scalability with verifiability for real-world heterogeneous ML compute.
comment: 17 pages, 7 figures
♻ ☆ Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of H\"older functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for H\"older functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of H\"older functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.
♻ ☆ TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs
With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as low as 1.58~bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected long latency of the prefill stage. We present \textbf{TeLLMe}, the first table-lookup-based ternary LLM accelerator for low-power edge FPGAs that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. TeLLMe incorporates several novel techniques, including (1) a table-lookup-based ternary matrix multiplication (TLMM) engine utilizing grouped activations and online precomputation for low resource utilization and high throughput; (2) a fine-grained analytic URAM-based weight buffer management scheme for efficient loading and compute engine access; (3) a streaming dataflow architecture that fuses floating-point element-wise operations with linear computations to hide latency; (4) a reversed-reordered prefill stage attention with fused attention operations for high memory efficiency; and (5) a resource-efficient specialized decoding stage attention. Under a 5~W power budget, TeLLMe delivers up to 25~tokens/s decoding throughput and 0.45--0.96~s time-to-first-token (TTFT) for 64--128 token prompts, marking a significant energy-efficiency advancement in LLM inference on edge FPGAs.
♻ ☆ Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.
♻ ☆ Stochastic Path Planning in Correlated Obstacle Fields
We introduce the Stochastic Correlated Obstacle Scene (SCOS) problem, a navigation setting with spatially correlated obstacles of uncertain blockage status, realistically constrained sensors that provide noisy readings and costly disambiguation. Modeling the spatial correlation with Gaussian Random Field (GRF), we develop Bayesian belief updates that refine blockage probabilities, and use the posteriors to reduce search space for efficiency. To find the optimal traversal policy, we propose a novel two-stage learning framework. An offline phase learns a robust base policy via optimistic policy iteration augmented with information bonus to encourage exploration in informative regions, followed by an online rollout policy with periodic base updates via a Bayesian mechanism for information adaptation. This framework supports both Monte Carlo point estimation and distributional reinforcement learning (RL) to learn full cost distributions, leading to stronger uncertainty quantification. We establish theoretical benefits of correlation-aware updating and convergence property under posterior sampling. Comprehensive empirical evaluations across varying obstacle densities, sensor capabilities demonstrate consistent performance gains over baselines. This framework addresses navigation challenges in environments with adversarial interruptions or clustered natural hazards.
♻ ☆ Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs
We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^\pi$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (\SPOIL), which is guaranteed to match the performance of any expert up to an additive error $\varepsilon$ with access to $\mathcal{O}(\varepsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^\pi$-realizable MDPs at the cost of a worse sample complexity of order $\mathcal{O}(\varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of \SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.
♻ ☆ Viability of perturbative expansion for quantum field theories on neurons
Neural Network (NN) architectures that break statistical independence of parameters have been proposed as a new approach for simulating local quantum field theories (QFTs). In the infinite neuron number limit, single-layer NNs can exactly reproduce QFT results. This paper examines the viability of this architecture for perturbative calculations of local QFTs for finite neuron number $N$ using scalar $\phi^4$ theory in $d$ Euclidean dimensions as an example. We find that the renormalized $O(1/N)$ corrections to two- and four-point correlators yield perturbative series which are sensitive to the ultraviolet cut-off and therefore have a weak convergence. We propose a modification to the architecture to improve this convergence and discuss constraints on the parameters of the theory and the scaling of N which allow us to extract accurate field theory results.
comment: Updated references and minor changes to the text
♻ ☆ Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures
Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.
comment: Waiting for Conference Response
♻ ☆ In-Context Learning of Stochastic Differential Equations with Foundation Inference Models NeurIPS 2025
Stochastic differential equations (SDEs) describe dynamical systems where deterministic flows, governed by a drift function, are superimposed with random fluctuations, dictated by a diffusion function. The accurate estimation (or discovery) of these functions from data is a central problem in machine learning, with wide application across the natural and social sciences. Yet current solutions either rely heavily on prior knowledge of the dynamics or involve intricate training procedures. We introduce FIM-SDE (Foundation Inference Model for SDEs), a pretrained recognition model that delivers accurate in-context (or zero-shot) estimation of the drift and diffusion functions of low-dimensional SDEs, from noisy time series data, and allows rapid finetuning to target datasets. Leveraging concepts from amortized inference and neural operators, we (pre)train FIM-SDE in a supervised fashion to map a large set of noisy, discretely observed SDE paths onto the space of drift and diffusion functions. We demonstrate that FIM-SDE achieves robust in-context function estimation across a wide range of synthetic and real-world processes -- from canonical SDE systems (e.g., double-well dynamics or weakly perturbed Lorenz attractors) to stock price recordings and oil-price and wind-speed fluctuations -- while matching the performance of symbolic, Gaussian process and Neural SDE baselines trained on the target datasets. When finetuned to the target processes, we show that FIM-SDE consistently outperforms all these baselines.
comment: Accepted at NeurIPS 2025. The previous version appeared under the title "Foundation Inference Models for Stochastic Differential Equations: A Transformer-based Approach for Zero-shot Function Estimation."
♻ ☆ Counterfactual reasoning: an analysis of in-context emergence NeurIPS
Large-scale neural language models exhibit remarkable performance in in-context learning: the ability to learn and reason about the input context on the fly. This work studies in-context counterfactual reasoning in language models, that is, the ability to predict consequences of a hypothetical scenario. We focus on a well-defined, synthetic linear regression task that requires noise abduction. Accurate prediction is based on (1) inferring an unobserved latent concept and (2) copying contextual noise from factual observations. We show that language models are capable of counterfactual reasoning. Further, we enhance existing identifiability results and reduce counterfactual reasoning for a broad class of functions to a transformation on in-context observations. In Transformers, we find that self-attention, model depth and pre-training data diversity drive performance. Moreover, we provide mechanistic evidence that the latent concept is linearly represented in the residual stream and we introduce designated \textit{noise abduction heads} central to performing counterfactual reasoning. Lastly, our findings extend to counterfactual reasoning under SDE dynamics and reflect that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/mrtzmllr/iccr.
comment: Published as a conference paper at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ Interpretable Decision-Making for End-to-End Autonomous Driving ICCV 2025
Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.
comment: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of Out-of-Label Hazards in Autonomous Driving (2COOOL)
♻ ☆ Who cuts emissions, who turns up the heat? causal machine learning estimates of energy efficiency interventions
Reducing domestic energy demand is central to climate mitigation and fuel poverty strategies, yet the impact of energy efficiency interventions is highly heterogeneous. Using a causal machine learning model trained on nationally representative data of the English housing stock, we estimate average and conditional treatment effects of wall insulation on gas consumption, focusing on distributional effects across energy burden subgroups. While interventions reduce gas demand on average (by as much as 19 percent), low energy burden groups achieve substantial savings, whereas those experiencing high energy burdens see little to no reduction. This pattern reflects a behaviourally-driven mechanism: households constrained by high costs-to-income ratios (e.g. more than 0.1) reallocate savings toward improved thermal comfort rather than lowering consumption. Far from wasteful, such responses represent rational adjustments in contexts of prior deprivation, with potential co-benefits for health and well-being. These findings call for a broader evaluation framework that accounts for both climate impacts and the equity implications of domestic energy policy.
♻ ☆ Inductive Domain Transfer In Misspecified Simulation-Based Inference
Simulation-based inference (SBI) is a statistical inference approach for estimating latent parameters of a physical system when the likelihood is intractable but simulations are available. In practice, SBI is often hindered by model misspecification--the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, a recent SBI approach, addresses this challenge through a two-stage domain transfer process that combines semi-supervised calibration with optimal transport (OT)-based distribution alignment. However, RoPE operates in a fully transductive setting, requiring access to a batch of test samples at inference time, which limits scalability and generalization. We propose here a fully inductive and amortized SBI framework that integrates calibration and distributional alignment into a single, end-to-end trainable model. Our method leverages mini-batch OT with a closed-form coupling to align real and simulated observations that correspond to the same latent parameters, using both paired calibration data and unpaired samples. A conditional normalizing flow is then trained to approximate the OT-induced posterior, enabling efficient inference without simulation access at test time. Across a range of synthetic and real-world benchmarks--including complex medical biomarker estimation--our approach matches or surpasses the performance of RoPE, as well as other standard SBI and non-SBI estimators, while offering improved scalability and applicability in challenging, misspecified environments.
♻ ☆ A unified framework for establishing the universal approximation of transformer-type architectures NeurIPS 2025
We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.
comment: NeurIPS 2025 camera-ready
♻ ☆ Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.
comment: 35 pages, 1 figure
♻ ☆ Dynamic object goal pushing with mobile manipulators through model-free constrained reinforcement learning ICRA 2025
Non-prehensile pushing to move and reorient objects to a goal is a versatile loco-manipulation skill. In the real world, the object's physical properties and friction with the floor contain significant uncertainties, which makes the task challenging for a mobile manipulator. In this paper, we develop a learning-based controller for a mobile manipulator to move an unknown object to a desired position and yaw orientation through a sequence of pushing actions. The proposed controller for the robotic arm and the mobile base motion is trained using a constrained Reinforcement Learning (RL) formulation. We demonstrate its capability in experiments with a quadrupedal robot equipped with an arm. The learned policy achieves a success rate of 91.35% in simulation and at least 80% on hardware in challenging scenarios. Through our extensive hardware experiments, we show that the approach demonstrates high robustness against unknown objects of different masses, materials, sizes, and shapes. It reactively discovers the pushing location and direction, thus achieving contact-rich behavior while observing only the pose of the object. Additionally, we demonstrate the adaptive behavior of the learned policy towards preventing the object from toppling.
comment: presented at ICRA 2025, Video: https://youtu.be/wGAdPGVf9Ws?si=pi83ONWofHHqbFG0
♻ ☆ FedMeld: A Model-dispersal Federated Learning Framework for Space-ground Integrated Networks
To bridge the digital divide, the space-ground integrated networks (SGINs), which will be a key component of the six-generation (6G) mobile networks, are expected to deliver artificial intelligence (AI) services to every corner of the world. One mission of SGINs is to support federated learning (FL) at a global scale. However, existing space-ground integrated FL frameworks involve ground stations or costly inter-satellite links, entailing excessive training latency and communication costs. To overcome these limitations, we propose an infrastructure-free federated learning framework based on a model dispersal (FedMeld) strategy, which exploits periodic movement patterns and store-carry-forward capabilities of satellites to enable parameter mixing across large-scale geographical regions. We theoretically show that FedMeld leads to global model convergence and quantify the effects of round interval and mixing ratio between adjacent areas on its learning performance. Based on the theoretical results, we formulate a joint optimization problem to design the staleness control and mixing ratio (SC-MR) for minimizing the training loss. By decomposing the problem into sequential SC and MR subproblems without compromising the optimality, we derive the round interval solution in a closed form and the mixing ratio in a semi-closed form to achieve the optimal latency-accuracy tradeoff. Experiments using various datasets demonstrate that FedMeld achieves superior model accuracy while significantly reducing communication costs as compared with traditional FL schemes for SGINs.
comment: 16 pages, 10 figures. This work has been submitted to the IEEE for possible publication
♻ ☆ A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI
Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous theoretical understanding of the contrastive pre-training framework remains limited. This paper develops a theoretical framework to explain the success of contrastive pre-training in downstream tasks, such as zero-shot classification, conditional diffusion models, and vision-language models. We introduce the concept of approximate sufficient statistics, a generalization of the classical sufficient statistics, and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. We further propose the Joint Generative Hierarchical Model for the joint distribution of images and text, showing that transformers can efficiently approximate relevant functions within this model via belief propagation. Building on this framework, we derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations. Numerical simulations validate these theoretical findings, demonstrating the strong generalization performance of contrastively pre-trained transformers in various multi-modal tasks.
♻ ☆ Review of Explainable Graph-Based Recommender Systems
Explainability of recommender systems has become essential to ensure users' trust and satisfaction. Various types of explainable recommender systems have been proposed including explainable graph-based recommender systems. This review paper discusses state-of-the-art approaches of these systems and categorizes them based on three aspects: learning methods, explaining methods, and explanation types. It also explores the commonly used datasets, explainability evaluation methods, and future directions of this research area. Compared with the existing review papers, this paper focuses on explainability based on graphs and covers the topics required for developing novel explainable graph-based recommender systems.
♻ ☆ From Reviews to Actionable Insights: An LLM-Based Approach for Attribute and Feature Extraction
This research proposes a systematic, large language model (LLM) approach for extracting product and service attributes, features, and associated sentiments from customer reviews. Grounded in marketing theory, the framework distinguishes perceptual attributes from actionable features, producing interpretable and managerially actionable insights. We apply the methodology to 20,000 Yelp reviews of Starbucks stores and evaluate eight prompt variants on a random subset of reviews. Model performance is assessed through agreement with human annotations and predictive validity for customer ratings. Results show high consistency between LLMs and human coders and strong predictive validity, confirming the reliability of the approach. Human coders required a median of six minutes per review, whereas the LLM processed each in two seconds, delivering comparable insights at a scale unattainable through manual coding. Managerially, the analysis identifies attributes and features that most strongly influence customer satisfaction and their associated sentiments, enabling firms to pinpoint "joy points," address "pain points," and design targeted interventions. We demonstrate how structured review data can power an actionable marketing dashboard that tracks sentiment over time and across stores, benchmarks performance, and highlights high-leverage features for improvement. Simulations indicate that enhancing sentiment for key service features could yield 1-2% average revenue gains per store.
♻ ☆ Backward Conformal Prediction
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tilde{\alpha}}(X_{\rm test})) \ge 1 - \mathbb{E}[\tilde{\alpha}]$ up to a first-order Taylor approximation for any data-dependent miscoverage $\tilde{\alpha}$, and (ii) a novel leave-one-out estimator $\hat{\alpha}^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tilde{\alpha}]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.
comment: Code available at: https://github.com/GauthierE/backward-cp
♻ ☆ Spike-timing-dependent Hebbian learning as noisy gradient descent
Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.
♻ ☆ Trial and Trust: Addressing Byzantine Attacks with Comprehensive Defense Strategy
Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust scores concept with trial function methodology to dynamically filter outliers. Our methods address the critical limitations of previous approaches, allowing functionality even when Byzantine nodes are in the majority. Moreover, our algorithms adapt to widely used scaled methods like Adam and RMSProp, as well as practical scenarios, including local training and partial participation. We validate the robustness of our methods by conducting extensive experiments on both synthetic and real ECG data collected from medical institutions. Furthermore, we provide a broad theoretical analysis of our algorithms and their extensions to aforementioned practical setups. The convergence guarantees of our methods are comparable to those of classical algorithms developed without Byzantine interference.
♻ ☆ Generation of Uncertainty-Aware Emergent Concepts in Factorized 3D Scene Graphs via Graph Neural Networks
Enabling robots to autonomously discover emergent spatial concepts (e.g., rooms) from primitive geometric observations (e.g., planar surfaces) within 3D Scene Graphs is essential for robust indoor navigation and mapping. These graphs provide a hierarchical metric-semantic representation in which such concepts are organized. To further enhance graph-SLAM performance, Factorized 3D Scene Graphs incorporate these concepts as optimization factors that constrain relative geometry and enforce global consistency. However, both stages of this process remain largely manual: concepts are typically derived using hand-crafted, concept-specific heuristics, while factors and their covariances are likewise manually designed. This reliance on manual specification limits generalization across diverse environments and scalability to new concept classes. This paper presents, for the first time, a learning-based method to generate online spatial emergent concepts as optimizable factors within a SLAM backend, reducing the need to handcraft both concept generation and the definition of their corresponding factors and covariances. In both simulated and real indoor scenarios, our approach improves complex concept detection by 20.7% and 5.3%, trajectory estimation by 19.2%, and map reconstruction by 12.3% and 3.8%, respectively, highlighting the benefits of this integration for robust and adaptive spatial understanding.
comment: Submitted to IEEE Robotics and Automation Letters (RA-L)
♻ ☆ The $\varphi$ Curve: The Shape of Generalization through the Lens of Norm-based Capacity Control NeurIPS'25
Understanding how the test risk scales with model complexity is a central question in machine learning. Classical theory is challenged by the learning curves observed for large over-parametrized deep networks. Capacity measures based on parameter count typically fail to account for these empirical observations. To tackle this challenge, we consider norm-based capacity measures and develop our study for random features based estimators, widely used as simplified theoretical models for more complex networks. In this context, we provide a precise characterization of how the estimator's norm concentrates and how it governs the associated test error. Our results show that the predicted learning curve admits a phase transition from under- to over-parameterization, but no double descent behavior. This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size. From a technical point of view, we leverage deterministic equivalence as the key tool and further develop new deterministic quantities which are of independent interest.
comment: Accepted by NeurIPS'25
♻ ☆ A Physics-Informed Spatiotemporal Deep Learning Framework for Turbulent Systems
Fluid thermodynamics underpins atmospheric dynamics, climate science, industrial applications, and energy systems. However, direct numerical simulations (DNS) of such systems can be computationally prohibitive. To address this, we present a novel physics-informed spatiotemporal surrogate model for Rayleigh-B\'enard convection (RBC), a canonical example of convective fluid flow. Our approach combines convolutional neural networks, for spatial dimension reduction, with an innovative recurrent architecture, inspired by large language models, to model long-range temporal dynamics. Inference is penalized with respect to the governing partial differential equations to ensure physical interpretability. Since RBC exhibits turbulent behavior, we quantify uncertainty using a conformal prediction framework. This model replicates key physical features of RBC dynamics while significantly reducing computational cost, offering a scalable alternative to DNS for long-term simulations.
♻ ☆ Neural Graduated Assignment for Maximum Common Edge Subgraphs
The Maximum Common Edge Subgraph (MCES) problem is a crucial challenge with significant implications in domains such as biology and chemistry. Traditional approaches, which include transformations into max-clique and search-based algorithms, suffer from scalability issues when dealing with larger instances. This paper introduces ``Neural Graduated Assignment'' (NGA), a simple, scalable, unsupervised-training-based method that addresses these limitations. Central to NGA is stacking of differentiable assignment optimization with neural components, enabling high-dimensional parameterization of the matching process through a learnable temperature mechanism. We further theoretically analyze the learning dynamics of NGA, showing its design leads to fast convergence, better exploration-exploitation tradeoff, and ability to escape local optima. Extensive experiments across MCES computation, graph similarity estimation, and graph retrieval tasks reveal that NGA not only significantly improves computation time and scalability on large instances but also enhances performance compared to existing methodologies. The introduction of NGA marks a significant advancement in the computation of MCES and offers insights into other assignment problems.
♻ ☆ Mitigating Prior Errors in Causal Structure Learning: A Resilient Approach via Bayesian Networks
Causal structure learning (CSL), a prominent technique for encoding cause-and-effect relationships among variables, through Bayesian Networks (BNs). Although recovering causal structure solely from data is a challenge, the integration of prior knowledge, revealing partial structural truth, can markedly enhance learning quality. However, current methods based on prior knowledge exhibit limited resilience to errors in the prior, with hard constraint methods disregarding priors entirely, and soft constraints accepting priors based on a predetermined confidence level, which may require expert intervention. To address this issue, we propose a strategy resilient to edge-level prior errors for CSL, thereby minimizing human intervention. We classify prior errors into different types and provide their theoretical impact on the Structural Hamming Distance (SHD) under the presumption of sufficient data. Intriguingly, we discover and prove that the strong hazard of prior errors is associated with a unique acyclic closed structure, defined as ``quasi-circle''. Leveraging this insight, a post-hoc strategy is employed to identify the prior errors by its impact on the increment of ``quasi-circles''. Through empirical evaluation on both real and synthetic datasets, we demonstrate our strategy's robustness against prior errors. Specifically, we highlight its substantial ability to resist order-reversed errors while maintaining the majority of correct prior.
comment: 12 pages, 4 figures
♻ ☆ On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration
Learning from observations (LfO) replicates expert behavior without needing access to the expert's actions, making it more practical than learning from demonstrations (LfD) in many real-world scenarios. However, directly applying the on-policy training scheme in LfO worsens the sample inefficiency problem, while employing the traditional off-policy training scheme in LfO magnifies the instability issue. This paper seeks to develop an efficient and stable solution for the LfO problem. Specifically, we begin by exploring the generalization capabilities of both the reward function and policy in LfO, which provides a theoretical foundation for computation. Building on this, we modify the policy optimization method in generative adversarial imitation from observation (GAIfO) with distributional soft actor-critic (DSAC), and propose the Mimicking Observations through Distributional Update Learning with adequate Exploration (MODULE) algorithm to solve the LfO problem. MODULE incorporates the advantages of (1) high sample efficiency and training robustness enhancement in soft actor-critic (SAC), and (2) training stability in distributional reinforcement learning (RL). Extensive experiments in MuJoCo environments showcase the superior performance of MODULE over current LfO methods.
♻ ☆ Efficient Verified Machine Unlearning For Distillation NeurIPS 2025
Growing data privacy demands, driven by regulations like GDPR and CCPA, require machine unlearning methods capable of swiftly removing the influence of specific training points. Although verified approaches like SISA, using data slicing and checkpointing, achieve efficient unlearning for single models by reverting to intermediate states, these methods struggle in teacher-student knowledge distillation settings. Unlearning in the teacher typically forces costly, complete student retraining due to pervasive information propagation during distillation. Our primary contribution is PURGE (Partitioned Unlearning with Retraining Guarantee for Ensembles), a novel framework integrating verified unlearning with distillation. We introduce constituent mapping and an incremental multi-teacher strategy that partitions the distillation process, confines each teacher constituent's impact to distinct student data subsets, and crucially maintains data isolation. The PURGE framework substantially reduces retraining overhead, requiring only partial student updates when teacher-side unlearning occurs. We provide both theoretical analysis, quantifying significant speed-ups in the unlearning process, and empirical validation on multiple datasets, demonstrating that PURGE achieves these efficiency gains while maintaining student accuracy comparable to standard baselines.
comment: Accepted at The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Learning Confidence Bounds for Classification with Imbalanced Data
Class imbalance poses a significant challenge in classification tasks, where traditional approaches often lead to biased models and unreliable predictions. Undersampling and oversampling techniques have been commonly employed to address this issue, yet they suffer from inherent limitations stemming from their simplistic approach such as loss of information and additional biases respectively. In this paper, we propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. We focus on understanding the uncertainty in a class-dependent manner, as captured by confidence bounds that we directly embed into the learning process. By incorporating class-dependent estimates, our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes. We empirically show how our framework provides a promising direction for handling imbalanced data in classification tasks, offering practitioners a valuable tool for building more accurate and trustworthy models.
comment: Published at ECAI 2024 main track (Frontiers in Artificial Intelligence and Applications)
♻ ☆ Pretraining a Shared Q-Network for Data-Efficient Offline Reinforcement Learning
Offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment. Collecting sufficiently large datasets for offline RL is exhausting since this data collection requires colossus interactions with environments and becomes tricky when the interaction with the environment is restricted. Hence, how an agent learns the best policy with a minimal static dataset is a crucial issue in offline RL, similar to the sample efficiency problem in online RL. In this paper, we propose a simple yet effective plug-and-play pretraining method to initialize a feature of a Q-network to enhance data efficiency in offline RL. Specifically, we introduce a shared Q-network structure that outputs predictions of the next state and Q-value. We pretrain the shared Q-network through a supervised regression task that predicts a next state and trains the shared Q-network using diverse offline RL methods. Through extensive experiments, we empirically demonstrate that our method enhances the performance of existing popular offline RL methods on the D4RL, Robomimic and V-D4RL benchmarks. Furthermore, we show that our method significantly boosts data-efficient offline RL across various data qualities and data distributions trough D4RL and ExoRL benchmarks. Notably, our method adapted with only 10% of the dataset outperforms standard algorithms even with full datasets.
♻ ☆ Beyond Benign Overfitting in Nadaraya-Watson Interpolators NeurIPS 2025
In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard's method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. In addition, for the purpose of tuning the hyperparameter, the results suggest that over-estimating the intrinsic dimension of the data is less harmful than under-estimating it. Numerical experiments complement our theory, demonstrating the same phenomena.
comment: NeurIPS 2025
♻ ☆ DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding NeurIPS 2025
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
comment: Accepted by Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Sparse Explanations of Neural Networks Using Pruned Layer-Wise Relevance Propagation ECML
Explainability is a key component in many applications involving deep neural networks (DNNs). However, current explanation methods for DNNs commonly leave it to the human observer to distinguish relevant explanations from spurious noise. This is not feasible anymore when going from easily human-accessible data such as images to more complex data such as genome sequences. To facilitate the accessibility of DNN outputs from such complex data and to increase explainability, we present a modification of the widely used explanation method layer-wise relevance propagation. Our approach enforces sparsity directly by pruning the relevance propagation for the different layers. Thereby, we achieve sparser relevance attributions for the input features as well as for the intermediate layers. As the relevance propagation is input-specific, we aim to prune the relevance propagation rather than the underlying model architecture. This allows to prune different neurons for different inputs and hence, might be more appropriate to the local nature of explanation methods. To demonstrate the efficacy of our method, we evaluate it on two types of data: images and genome sequences. We show that our modification indeed leads to noise reduction and concentrates relevance on the most important features compared to the baseline.
comment: Presented at ECML PKDD 2024
♻ ☆ Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning ICML 2025
Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability -- often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM's abilities to re-purpose parametric knowledge in novel settings.
comment: ICML 2025 Workshop on Scaling up Intervention Models
♻ ☆ Patent Language Model Pretraining with ModernBERT
Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.
comment: 7 pages, 5 figures, 4 tables
♻ ☆ LAMP-PRo: Label-aware Attention for Multi-label Prediction of DNA- and RNA-binding Proteins using Protein Language Models
Identifying DNA- (DBPs) and RNA-binding proteins (RBPs) is crucial for the understanding of cell function, molecular interactions as well as regulatory functions. Owing to their high similarity, most of the existing approaches face challenges in differentiating between DBPs and RBPs leading to high cross-prediction errors. Moreover, identifying proteins which bind to both DNA and RNA (DRBPs) is also quite a challenging task. In this regard, we propose a novel framework viz. LAMP-PRo which is based on pre-trained protein language model (PLM), attention mechanisms and multi-label learning to mitigate these issues. First, pre-trained PLM such ESM-2 is used for embedding the protein sequences followed by convolutional neural network (CNN). Subsequently multi-head self-attention mechanism is applied for the contextual information while label-aware attention is used to compute class-specific representations by attending to the sequence in a way that is tailored to each label (DBP, RBP and non-NABP) in a multi-label setup. We have also included a novel cross-label attention mechanism to explicitly capture dependencies between DNA- and RNA-binding proteins, enabling more accurate prediction of DRBP. Finally, a linear layer followed by a sigmoid function are used for the final prediction. Extensive experiments are carried out to compare LAMP-PRo with the existing methods wherein the proposed model shows consistent competent performance. Furthermore, we also provide visualization to showcase model interpretability, highlighting which parts of the sequence are most relevant for a predicted label. The original datasets are available at http://bliulab.net/iDRBP\_MMC and the codes are available at https://github.com/NimishaGhosh/LAMP-PRo.
♻ ☆ A representational framework for learning and encoding structurally enriched trajectories in complex agent environments
The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios. One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them in state-action transitions. Whereas such representations are procedurally efficient, they lack structural richness. To address this problem, we propose to enhance the agent's ontology and extend the traditional conceptualisation of trajectories to provide a more nuanced view of task execution. Structurally Enriched Trajectories (SETs) extend the encoding of sequences of states and their transitions by incorporating hierarchical relations between objects, interactions, and affordances. SETs are built as multi-level graphs, providing a detailed representation of the agent dynamics and a transferable functional abstraction of the task. SETs are integrated into an architecture, Structurally Enriched Trajectory Learning and Encoding (SETLE), that employs a heterogeneous graph-based memory structure of multi-level relational dependencies essential for generalisation. We demonstrate that SETLE can support downstream tasks, enabling agents to recognise task relevant structural patterns across CREATE and MiniGrid environments. Finally, we integrate SETLE with reinforcement learning and show measurable improvements in downstream performance, including breakthrough success rates in complex, sparse-reward tasks.
♻ ☆ ReVeal: Self-Evolving Code Agents via Reliable Self-Verification
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
♻ ☆ LENS: Large Pre-trained Transformer for Exploring Financial Time Series Regularities
Modeling large-scale time series has gained significant attention in recent years. However, its direct application in finance remains challenging due to substantial differences in data characteristics across domains. Specifically, financial systems feature inherent stochasticity and low signal-to-noise ratios, rendering traditional methods and pre-training approaches ineffective. This underscores the urgent need for a foundation model tailored to financial time series. To bridge this gap, we propose \textbf{LENS}, a pre-trained model for this domain. \textbf{LENS} effectively captures the complexity of financial stochastic systems through a carefully crafted model architecture and mitigates noise during pre-training by using an invertible embedding module. We provide a rigorous theoretical explanation of the model's effectiveness and validate its performance through extensive experiments. Pre-trained on a dataset comprising 100 billion financial observations, \textbf{LENS} achieves exceptional results across a wide range of critical downstream tasks. Moreover, our work offers practical insights into developing pre-trained time series models in high-noise environments, paving the way for further advancements in this pivotal research domain.
♻ ☆ Molecular Fingerprints Are Strong Models for Peptide Function Prediction
Understanding peptide properties is often assumed to require modeling long-range molecular interactions, motivating the use of complex graph neural networks and pretrained transformers. Yet, whether such long-range dependencies are essential remains unclear. We investigate if simple, domain-specific molecular fingerprints can capture peptide function without these assumptions. Atomic-level representation aims to provide richer information than purely sequence-based models and better efficiency than structural ones. Across 132 datasets, including LRGB and five other peptide benchmarks, models using count-based ECFP, Topological Torsion, and RDKit fingerprints with LightGBM achieve state-of-the-art accuracy. Despite encoding only short-range molecular features, these models outperform GNNs and transformer-based approaches. Control experiments with sequence shuffling and amino acid counts confirm that fingerprints, though inherently local, suffice for robust peptide property prediction. Our results challenge the presumed necessity of long-range interaction modeling and highlight molecular fingerprints as efficient, interpretable, and computationally lightweight alternatives for peptide prediction.
♻ ☆ Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation NeurIPS 2025
Diffusion-based inverse algorithms have shown remarkable performance across various inverse problems, yet their reliance on numerous denoising steps incurs high computational costs. While recent developments of fast diffusion ODE solvers offer effective acceleration for diffusion sampling without observations, their application in inverse problems remains limited due to the heterogeneous formulations of inverse algorithms and their prevalent use of approximations and heuristics, which often introduce significant errors that undermine the reliability of analytical solvers. In this work, we begin with an analysis of ODE solvers for inverse problems that reveals a linear combination structure of approximations for the inverse trajectory. Building on this insight, we propose a canonical form that unifies a broad class of diffusion-based inverse algorithms and facilitates the design of more generalizable solvers. Inspired by the linear subspace search strategy, we propose Learnable Linear Extrapolation (LLE), a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm conforming to our canonical form. LLE optimizes the combination coefficients to refine current predictions using previous estimates, alleviating the sensitivity of analytical solvers for inverse algorithms. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
comment: Accepted by NeurIPS 2025
♻ ☆ Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
comment: 10 pages, 3 figures. v2 adds discussion of related work (G-Pass@k)
♻ ☆ S-CFE: Simple Counterfactual Explanations
We study the problem of finding optimal sparse, manifold-aligned counterfactual explanations for classifiers. Canonically, this can be formulated as an optimization problem with multiple non-convex components, including classifier loss functions and manifold alignment (or \emph{plausibility}) metrics. The added complexity of enforcing \emph{sparsity}, or shorter explanations, complicates the problem further. Existing methods often focus on specific models and plausibility measures, relying on convex $\ell_1$ regularizers to enforce sparsity. In this paper, we tackle the canonical formulation using the accelerated proximal gradient (APG) method, a simple yet efficient first-order procedure capable of handling smooth non-convex objectives and non-smooth $\ell_p$ (where $0 \leq p < 1$) regularizers. This enables our approach to seamlessly incorporate various classifiers and plausibility measures while producing sparser solutions. Our algorithm only requires differentiable data-manifold regularizers and supports box constraints for bounded feature ranges, ensuring the generated counterfactuals remain \emph{actionable}. Finally, experiments on real-world datasets demonstrate that our approach effectively produces sparse, manifold-aligned counterfactual explanations while maintaining proximity to the factual data and computational efficiency.
♻ ☆ Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking
Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBbind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.
♻ ☆ A surrogate model for topology optimisation of elastic structures via parametric autoencoders
A surrogate-based topology optimisation algorithm for linear elastic structures under parametric loads and boundary conditions is proposed. Instead of learning the parametric solution of the state (and adjoint) problems or the optimisation trajectory as a function of the iterations, the proposed approach devises a surrogate version of the entire optimisation pipeline. First, the method predicts a quasi-optimal topology for a given problem configuration as a surrogate model of high-fidelity topologies optimised with the homogenisation method. This is achieved by means of a feed-forward net learning the mapping between the input parameters characterising the system setup and a latent space determined by encoder/decoder blocks reducing the dimensionality of the parametric topology optimisation problem and reconstructing a high-dimensional representation of the topology. Then, the predicted topology is used as an educated initial guess for a computationally efficient algorithm penalising the intermediate values of the design variable, while enforcing the governing equations of the system. This step allows the method to correct potential errors introduced by the surrogate model, eliminate artifacts, and refine the design in order to produce topologies consistent with the underlying physics. Different architectures are proposed and the approximation and generalisation capabilities of the resulting models are numerically evaluated. The quasi-optimal topologies allow to outperform the high-fidelity optimiser by reducing the average number of optimisation iterations by $53\%$ while achieving discrepancies below $4\%$ in the optimal value of the objective functional, even in the challenging scenario of testing the model to extrapolate beyond the training and validation domain.
comment: 43 pages, 13 figures, 7 tables
♻ ☆ Learning Fairer Representations with FairVIC
Mitigating bias in automated decision-making systems, particularly in deep learning models, is a critical challenge due to nuanced definitions of fairness, dataset-specific biases, and the inherent trade-off between fairness and accuracy. To address these issues, we introduce FairVIC, an innovative approach that enhances fairness in neural networks by integrating variance, invariance, and covariance terms into the loss function during training. Unlike methods that rely on predefined fairness criteria, FairVIC abstracts fairness concepts to minimise dependency on protected characteristics. We evaluate FairVIC against comparable bias mitigation techniques on benchmark datasets, considering both group and individual fairness, and conduct an ablation study on the accuracy-fairness trade-off. FairVIC demonstrates significant improvements ($\approx70\%$) in fairness across all tested metrics without compromising accuracy, thus offering a robust, generalisable solution for fair deep learning across diverse tasks and datasets.
♻ ☆ Airfoil optimization using Design-by-Morphing with minimized design-space dimensionality
Effective airfoil geometry optimization requires exploring a diverse range of designs using as few design variables as possible. This study introduces AirDbM, a Design-by-Morphing (DbM) approach specialized for airfoil optimization that systematically reduces design-space dimensionality. AirDbM selects an optimal set of 12 baseline airfoils from the UIUC airfoil database, which contains over 1,600 shapes, by sequentially adding the baseline that most increases the design capacity. With these baselines, AirDbM reconstructs 99 % of the database with a mean absolute error below 0.005, which matches the performance of a previous DbM approach that used more baselines. In multi-objective aerodynamic optimization, AirDbM demonstrates rapid convergence and achieves a Pareto front with a greater hypervolume than that of the previous larger-baseline study, where new Pareto-optimal solutions are discovered with enhanced lift-to-drag ratios at moderate stall tolerances. Furthermore, AirDbM demonstrates outstanding adaptability for reinforcement learning (RL) agents in generating airfoil geometry when compared to conventional airfoil parameterization methods, implying the broader potential of DbM in machine learning-driven design.
♻ ☆ Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation
View missing remains a significant challenge in graph-based multi-view semi-supervised learning, hindering their real-world applications. To address this issue, traditional methods introduce a missing indicator matrix and focus on mining partial structure among existing samples in each view for label propagation (LP). However, we argue that these disregarded missing samples sometimes induce discontinuous local structures, i.e., sub-clusters, breaking the fundamental smoothness assumption in LP. Consequently, such a Sub-Cluster Problem (SCP) would distort graph fusion and degrade classification performance. To alleviate SCP, we propose a novel incomplete multi-view semi-supervised learning method, termed AGF-TI. Firstly, we design an adversarial graph fusion scheme to learn a robust consensus graph against the distorted local structure through a min-max framework. By stacking all similarity matrices into a tensor, we further recover the incomplete structure from the high-order consistency information based on the low-rank tensor learning. Additionally, the anchor-based strategy is incorporated to reduce the computational complexity. An efficient alternative optimization algorithm combining a reduced gradient descent method is developed to solve the formulated objective, with theoretical convergence. Extensive experimental results on various datasets validate the superiority of our proposed AGF-TI as compared to state-of-the-art methods. Code is available at https://github.com/ZhangqiJiang07/AGF_TI.
comment: 31 pages, 15 figures
♻ ☆ MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization NeurIPS 2025
Meta-Black-Box Optimization (MetaBBO) streamlines the automation of optimization algorithm design through meta-learning. It typically employs a bi-level structure: the meta-level policy undergoes meta-training to reduce the manual effort required in developing algorithms for low-level optimization tasks. The original MetaBox (2023) provided the first open-source framework for reinforcement learning-based single-objective MetaBBO. However, its relatively narrow scope no longer keep pace with the swift advancement in this field. In this paper, we introduce MetaBox-v2 (https://github.com/MetaEvo/MetaBox) as a milestone upgrade with four novel features: 1) a unified architecture supporting RL, evolutionary, and gradient-based approaches, by which we reproduce $23$ up-to-date baselines; 2) efficient parallelization schemes, which reduce the training/testing time by $10-40$x; 3) a comprehensive benchmark suite of $18$ synthetic/realistic tasks ($1900$+ instances) spanning single-objective, multi-objective, multi-model, and multi-task optimization scenarios; 4) plentiful and extensible interfaces for custom analysis/visualization and integrating to external optimization tools/benchmarks. To show the utility of MetaBox-v2, we carry out a systematic case study that evaluates the built-in baselines in terms of the optimization performance, generalization ability and learning efficiency. Valuable insights are concluded from thorough and detailed analysis for practitioners and those new to the field.
comment: Accepted by NeurIPS 2025
♻ ☆ The Narcissus Hypothesis: Descending to the Rung of Illusion NeurIPS 2025
Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.
comment: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
♻ ☆ REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
♻ ☆ Training on Plausible Counterfactuals Removes Spurious Correlations
Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.
♻ ☆ VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching NeurIPS 2025
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.
comment: Accepted to NeurIPS 2025
♻ ☆ The Spacetime of Diffusion Models: An Information Geometry Perspective
We present a novel geometric perspective on the latent space of diffusion models. We first show that the standard pullback approach, utilizing the deterministic probability flow ODE decoder, is fundamentally flawed. It provably forces geodesics to decode as straight segments in data space, effectively ignoring any intrinsic data geometry beyond the ambient Euclidean space. Complementing this view, diffusion also admits a stochastic decoder via the reverse SDE, which enables an information geometric treatment with the Fisher-Rao metric. However, a choice of $x_T$ as the latent representation collapses this metric due to memorylessness. We address this by introducing a latent spacetime $z=(x_t,t)$ that indexes the family of denoising distributions $p(x_0 | x_t)$ across all noise scales, yielding a nontrivial geometric structure. We prove these distributions form an exponential family and derive simulation-free estimators for curve lengths, enabling efficient geodesic computation. The resulting structure induces a principled Diffusion Edit Distance, where geodesics trace minimal sequences of noise and denoise edits between data. We also demonstrate benefits for transition path sampling in molecular systems, including constrained variants such as low-variance transitions and region avoidance. Code is available at: https://github.com/rafalkarczewski/spacetime-geometry
♻ ☆ ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition NeurIPS 2025
Many critical applications, from autonomous scientific discovery to personalized medicine, demand systems that can both strategically acquire the most informative data and instantaneously perform inference based upon it. While amortized methods for Bayesian inference and experimental design offer part of the solution, neither approach is optimal in the most general and challenging task, where new data needs to be collected for instant inference. To tackle this issue, we introduce the Amortized Active Learning and Inference Engine (ALINE), a unified framework for amortized Bayesian inference and active data acquisition. ALINE leverages a transformer architecture trained via reinforcement learning with a reward based on self-estimated information gain provided by its own integrated inference component. This allows it to strategically query informative data points while simultaneously refining its predictions. Moreover, ALINE can selectively direct its querying strategy towards specific subsets of model parameters or designated predictive tasks, optimizing for posterior estimation, data prediction, or a mixture thereof. Empirical results on regression-based active learning, classical Bayesian experimental design benchmarks, and a psychometric model with selectively targeted parameters demonstrate that ALINE delivers both instant and accurate inference along with efficient selection of informative points.
comment: 27 pages, 13 figures. Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ A Flow-Based Model for Conditional and Probabilistic Electricity Consumption Profile Generation and Prediction
Residential Load Profile (RLP) generation and prediction are critical for the operation and planning of distribution networks, especially as diverse low-carbon technologies (e.g., photovoltaic and electric vehicles) are increasingly adopted. This paper introduces a novel flow-based generative model, termed Full Convolutional Profile Flow (FCPFlow), which is uniquely designed for both conditional and unconditional RLP generation, and for probabilistic load forecasting. By introducing two new layers--the invertible linear layer and the invertible normalization layer--the proposed FCPFlow architecture shows three main advantages compared to traditional statistical and contemporary deep generative models: 1) it is well-suited for RLP generation under continuous conditions, such as varying weather and annual electricity consumption, 2) it demonstrates superior scalability in different datasets compared to traditional statistical models, and 3) it also demonstrates better modeling capabilities in capturing the complex correlation of RLPs compared with deep generative models.
♻ ☆ Shuffling Heuristic in Variational Inequalities: Establishing New Convergence Guarantees
Variational inequalities have gained significant attention in machine learning and optimization research. While stochastic methods for solving these problems typically assume independent data sampling, we investigate an alternative approach -- the shuffling heuristic. This strategy involves permuting the dataset before sequential processing, ensuring equal consideration of all data points. Despite its practical utility, theoretical guarantees for shuffling in variational inequalities remain unexplored. We address this gap by providing the first theoretical convergence estimates for shuffling methods in this context. Our analysis establishes rigorous bounds and convergence rates, extending the theoretical framework for this important class of algorithms. We validate our findings through extensive experiments on diverse benchmark variational inequality problems, demonstrating faster convergence of shuffling methods compared to independent sampling approaches.
comment: 25 pages, 5 figures, 2 tables
♻ ☆ ProSh: Probabilistic Shielding for Model-free Reinforcement Learning
Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent's policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.
♻ ☆ Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
comment: 58 pages, 5 figures, 5 tables
♻ ☆ Expressive Reward Synthesis with the Runtime Monitoring Language
A key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours. Indeed, reward functions in RL are typically treated as black-box mappings from state-action pairs to scalar values. While effective in many settings, this approach provides no information about why rewards are given, which can hinder learning and interpretability. Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured, non-Markovian reward functions. However, their expressivity is typically bounded by regular languages, leaving them unable to capture more complex behaviours such as counting or parametrised conditions. In this work, we build on the Runtime Monitoring Language (RML) to develop a novel class of language-based Reward Machines. By leveraging the built-in memory of RML, our approach can specify reward functions for non-regular, non-Markovian tasks. We demonstrate the expressiveness of our approach through experiments, highlighting additional advantages in flexible event-handling and task specification over existing Reward Machine-based methods.
♻ ☆ MEET-Sepsis: Multi-Endogenous-View Enhanced Time-Series Representation Learning for Early Sepsis Prediction
Sepsis is a life-threatening infectious syndrome associated with high mortality in intensive care units (ICUs). Early and accurate sepsis prediction (SP) is critical for timely intervention, yet remains challenging due to subtle early manifestations and rapidly escalating mortality. While AI has improved SP efficiency, existing methods struggle to capture weak early temporal signals. This paper introduces a Multi-Endogenous-view Representation Enhancement (MERE) mechanism to construct enriched feature views, coupled with a Cascaded Dual-convolution Time-series Attention (CDTA) module for multi-scale temporal representation learning. The proposed MEET-Sepsis framework achieves competitive prediction accuracy using only 20% of the ICU monitoring time required by SOTA methods, significantly advancing early SP. Extensive validation confirms its efficacy. Code is available at: https://github.com/yueliangy/MEET-Sepsis.
comment: Accepted to PRICAI 2025
♻ ☆ FlashBias: Fast Computation of Attention with Bias
Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query-key scores, has been widely deployed in vision, language, protein-folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalizations. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for Pairformer in AlphaFold 3, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https://github.com/thuml/FlashBias.
♻ ☆ Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning NeurIPS2025
We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.
comment: NeurIPS2025
♻ ☆ DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training
Training multimodal process reward models (PRMs) is hard due to (i) distribution shift between training set and test set and (ii) quality imbalance across training data samples. While domain-level reweighting (e.g., DreamPRM) aligns training with test-time objectives, it leaves a clear gap to an oracle upper bound (pass@N), even under a "sanity check" that uses test set data to probe headroom -- pointing to meta-level under-parameterization. We introduce DreamPRM-1.5, an instance-level reweighting framework that assigns an adaptive weight to every training example via bi-level optimization. To realize instance reweighting across scales, we develop two complementary regimes: Instance Table, which learns explicit per-sample weights and excels on small/medium data, and Instance Net, a lightweight neural network that generalizes better and scales to large corpora. A practical, stable training recipe -- time-scale matching between upper/lower updates, cold-start initialization, and bounded-range weights -- prevents divergence. Integrated with test-time scaling, DreamPRM-1.5 attains 84.6 accuracy on the MMMU validation set, 31.3 accuracy on R-Bench-V and, when paired with a leading backbone (e.g., GPT-5-mini), achieves first-place results on public multimodal reasoning leaderboards. Moreover, extensive experiments, including benchmark evaluations, baseline comparisons, and a sanity check, demonstrate that DreamPRM-1.5 closes the gap toward the oracle, achieves leading performance, and trains stably.
♻ ☆ Asynchronous Federated Learning: A Scalable Approach for Decentralized Machine Learning
Federated Learning (FL) has emerged as a powerful paradigm for decentralized machine learning, enabling collaborative model training across diverse clients without sharing raw data. However, traditional FL approaches often face limitations in scalability and efficiency due to their reliance on synchronous client updates, which can result in significant delays and increased communication overhead, particularly in heterogeneous and dynamic environments. To address these challenges in this paper, we propose an Asynchronous Federated Learning (AFL) algorithm, which allows clients to update the global model independently and asynchronously. Our key contributions include a comprehensive convergence analysis of AFL in the presence of client delays and model staleness. By leveraging martingale difference sequence theory and variance bounds, we ensure robust convergence despite asynchronous updates. Assuming strongly convex local objective functions, we establish bounds on gradient variance under random client sampling and derive a recursion formula quantifying the impact of client delays on convergence. Furthermore, we demonstrate the practical applicability of the AFL algorithm by training decentralized linear regression and Support Vector Machine (SVM) based classifiers and compare its results with synchronous FL algorithm to effectively handling non-IID data distributed among clients. The proposed AFL algorithm addresses key limitations of traditional FL methods, such as inefficiency due to global synchronization and susceptibility to client drift. It enhances scalability, robustness, and efficiency in real-world settings with heterogeneous client populations and dynamic network conditions. Our results underscore the potential of AFL to drive advancements indistributed learning systems, particularly for large-scale, privacy-preserving applications in resource-constrained environments.
♻ ☆ Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning NeurIPS 2025
Symmetry is pervasive in robotics and has been widely exploited to improve sample efficiency in deep reinforcement learning (DRL). However, existing approaches primarily focus on spatial symmetries, such as reflection, rotation, and translation, while largely neglecting temporal symmetries. To address this gap, we explore time reversal symmetry, a form of temporal symmetry commonly found in robotics tasks such as door opening and closing. We propose Time Reversal symmetry enhanced Deep Reinforcement Learning (TR-DRL), a framework that combines trajectory reversal augmentation and time reversal guided reward shaping to efficiently solve temporally symmetric tasks. Our method generates reversed transitions from fully reversible transitions, identified by a proposed dynamics-consistent filter, to augment the training data. For partially reversible transitions, we apply reward shaping to guide learning, according to successful trajectories from the reversed task. Extensive experiments on the Robosuite and MetaWorld benchmarks demonstrate that TR-DRL is effective in both single-task and multi-task settings, achieving higher sample efficiency and stronger final performance compared to baseline methods.
comment: Accepted in NeurIPS 2025
♻ ☆ Fair Minimum Labeling: Efficient Temporal Network Activations for Reachability and Equity NeurIPS 2025
Balancing resource efficiency and fairness is critical in networked systems that support modern learning applications. We introduce the Fair Minimum Labeling (FML) problem: the task of designing a minimum-cost temporal edge activation plan that ensures each group of nodes in a network has sufficient access to a designated target set, according to specified coverage requirements. FML captures key trade-offs in systems where edge activations incur resource costs and equitable access is essential, such as distributed data collection, update dissemination in edge-cloud systems, and fair service restoration in critical infrastructure. We show that FML is NP-hard and $\Omega(\log |V|)$-hard to approximate, where $V$ is the set of nodes, and we present probabilistic approximation algorithms that match this bound, achieving the best possible guarantee for the activation cost. We demonstrate the practical utility of FML in a fair multi-source data aggregation task for training a shared model. Empirical results show that FML enforces group-level fairness with substantially lower activation cost than baseline heuristics, underscoring its potential for building resource-efficient, equitable temporal reachability in learning-integrated networks.
comment: Accepted at NeurIPS 2025
♻ ☆ Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples NeurIPS 2025
Sample selection is a prevalent approach in learning with noisy labels, aiming to identify confident samples for training. Although existing sample selection methods have achieved decent results by reducing the noise rate of the selected subset, they often overlook that not all mislabeled examples harm the model's performance equally. In this paper, we demonstrate that mislabeled examples correctly predicted by the model early in the training process are particularly harmful to model performance. We refer to these examples as Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting, which introduces a recalibration step that employs the model's later training state to re-select the confident subset identified early in training, thereby avoiding misleading confidence from early learning and effectively filtering out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that our method effectively improves sample selection and model performance by reducing MEEs.
comment: Accepted by NeurIPS 2025
♻ ☆ Model-based Implicit Neural Representation for sub-wavelength Radio Localization
The increasing deployment of large antenna arrays at base stations has significantly improved the spatial resolution and localization accuracy of radio-localization methods. However, traditional signal processing techniques struggle in complex radio environments, particularly in scenarios dominated by non line of sight (NLoS) propagation paths, resulting in degraded localization accuracy. Recent developments in machine learning have facilitated the development of machine learning-assisted localization techniques, enhancing localization accuracy in complex radio environments. However, these methods often involve substantial computational complexity during both the training and inference phases. This work extends the well-established fingerprinting-based localization framework by simultaneously reducing its memory requirements and improving its accuracy. Specifically, a model-based neural network is used to learn the location-to-channel mapping, and then serves as a generative neural channel model. This generative model augments the fingerprinting comparison dictionary while reducing the memory requirements. The proposed method outperforms fingerprinting baselines by achieving sub-wavelength localization accuracy, even in complex static NLoS environments. Remarkably, it offers an improvement by several orders of magnitude in localization accuracy, while simultaneously reducing memory requirements by an order of magnitude compared to classical fingerprinting methods.
♻ ☆ Explaining Large Language Models with gSMILE
Large Language Models (LLMs) such as GPT, LLaMA, and Claude achieve remarkable performance in text generation but remain opaque in their decision-making processes, limiting trust and accountability in high-stakes applications. We present gSMILE (generative SMILE), a model-agnostic, perturbation-based framework for token-level interpretability in LLMs. Extending the SMILE methodology, gSMILE uses controlled prompt perturbations, Wasserstein distance metrics, and weighted linear surrogates to identify input tokens with the most significant impact on the output. This process enables the generation of intuitive heatmaps that visually highlight influential tokens and reasoning paths. We evaluate gSMILE across leading LLMs (OpenAI's gpt-3.5-turbo-instruct, Meta's LLaMA 3.1 Instruct Turbo, and Anthropic's Claude 2.1) using attribution fidelity, attribution consistency, attribution stability, attribution faithfulness, and attribution accuracy as metrics. Results show that gSMILE delivers reliable human-aligned attributions, with Claude 2.1 excelling in attention fidelity and GPT-3.5 achieving the highest output consistency. These findings demonstrate gSMILE's ability to balance model performance and interpretability, enabling more transparent and trustworthy AI systems.
♻ ☆ PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning
Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.
♻ ☆ Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.
♻ ☆ Low-cost Embedded Breathing Rate Determination Using 802.15.4z IR-UWB Hardware for Remote Healthcare
Respiratory diseases account for a significant portion of global mortality. Affordable and early detection is an effective way of addressing these ailments. To this end, a low-cost commercial off-the-shelf (COTS), IEEE 802.15.4z standard compliant impulse-radio ultra-wideband (IR-UWB) radar system is used to estimate human respiration rates. We propose a convolutional neural network (CNN) specifically adapted to predict breathing rates from ultra-wideband (UWB) channel impulse response (CIR) data, and compare its performance with both other rule-based algorithms and model-based solutions. The study uses a diverse dataset, incorporating various real-life environments to evaluate system robustness. To facilitate future research, this dataset will be released as open source. Results show that the CNN achieves a mean absolute error (MAE) of 1.73 breaths per minute (BPM) in unseen situations, significantly outperforming rule-based methods (3.40 BPM). By incorporating calibration data from other individuals in the unseen situations, the error is further reduced to 0.84 BPM. In addition, this work evaluates the feasibility of running the pipeline on a low-cost embedded device. Applying 8-bit quantization to both the weights and input/ouput tensors, reduces memory requirements by 67% and inference time by 64% with only a 3% increase in MAE. As a result, we show it is feasible to deploy the algorithm on an nRF52840 system-on-chip (SoC) requiring only 46 KB of memory and operating with an inference time of only 192 ms. Once deployed, an analytical energy model estimates that the system, while continuously monitoring the room, can operate for up to 268 days without recharging when powered by a 20 000 mAh battery pack. For breathing monitoring in bed, the sampling rate can be lowered, extending battery life to 313 days, making the solution highly efficient for real-world, low-cost deployments.
comment: This paper has been submitted to IEEE Sensors Journal and is currently undergoing review
♻ ☆ Estimating Model Performance Under Covariate Shift Without Labels
After deployment, machine learning models often experience performance degradation due to shifts in data distribution. It is challenging to assess post-deployment performance accurately when labels are missing or delayed. Existing proxy methods, such as data drift detection, fail to measure the effects of these shifts adequately. To address this, we introduce a new method for evaluating binary classification models on unlabeled tabular data that accurately estimates model performance under covariate shift and call it Probabilistic Adaptive Performance Estimation (PAPE). It can be applied to any performance metric defined with elements of the confusion matrix. Crucially, PAPE operates independently of the original model, relying only on its predictions and probability estimates, and does not need any assumptions about the nature of covariate shift, learning directly from data instead. We tested PAPE using over 900 dataset-model combinations from US census data, assessing its performance against several benchmarks through various metrics. Our findings show that PAPE outperforms other methodologies, making it a superior choice for estimating the performance of binary classification models.
comment: 25 content pages, 7 figures
♻ ☆ SpecExit: Accelerating Large Reasoning Model via Speculative Exit
Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66\% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.
♻ ☆ gen2seg: Generative Models Enable Generalizable Instance Segmentation
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.
comment: Website: https://reachomk.github.io/gen2seg/
♻ ☆ Dynamic Diffusion Schrödinger Bridge in Astrophysical Observational Inversions NeurIPS 2025
We study Diffusion Schr\"odinger Bridge (DSB) models in the context of dynamical astrophysical systems, specifically tackling observational inverse prediction tasks within Giant Molecular Clouds (GMCs) for star formation. We introduce the Astro-DSB model, a variant of DSB with the pairwise domain assumption tailored for astrophysical dynamics. By investigating its learning process and prediction performance in both physically simulated data and in real observations (the Taurus B213 data), we present two main takeaways. First, from the astrophysical perspective, our proposed paired DSB method improves interpretability, learning efficiency, and prediction performance over conventional astrostatistical and other machine learning methods. Second, from the generative modeling perspective, probabilistic generative modeling reveals improvements over discriminative pixel-to-pixel modeling in Out-Of-Distribution (OOD) testing cases of physical simulations with unseen initial conditions and different dominant physical processes. Our study expands research into diffusion models beyond the traditional visual synthesis application and provides evidence of the models' learning abilities beyond pure data statistics, paving a path for future physics-aware generative models which can align dynamics between machine learning and real (astro)physical systems.
comment: Accepted to NeurIPS 2025. Code available at https://github.com/L-YeZhu/AstroDSB
♻ ☆ Learning to Interpret Weight Differences in Language Models
Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes ("weight diffs") are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.
comment: Project code and links to weight diffs, adapters, and training data can be found at https://github.com/Aviously/diff-interpretation-tuning
♻ ☆ Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models NeurIPS 2025
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.
comment: Accepted at NeurIPS 2025
♻ ☆ Preference-driven Knowledge Distillation for Few-shot Node Classification NeurIPS 2025
Graph neural networks (GNNs) can efficiently process text-attributed graphs (TAGs) due to their message-passing mechanisms, but their training heavily relies on the human-annotated labels. Moreover, the complex and diverse local topologies of nodes of real-world TAGs make it challenging for a single mechanism to handle. Large language models (LLMs) perform well in zero-/few-shot learning on TAGs but suffer from a scalability challenge. Therefore, we propose a preference-driven knowledge distillation (PKD) framework to synergize the complementary strengths of LLMs and various GNNs for few-shot node classification. Specifically, we develop a GNN-preference-driven node selector that effectively promotes prediction distillation from LLMs to teacher GNNs. To further tackle nodes' intricate local topologies, we develop a node-preference-driven GNN selector that identifies the most suitable teacher GNN for each node, thereby facilitating tailored knowledge distillation from teacher GNNs to the student GNN. Extensive experiments validate the efficacy of our proposed framework in few-shot node classification on real-world TAGs. Our code is available.
comment: Accepted by NeurIPS 2025
♻ ☆ Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.
comment: Project page at: https://xenozlh.github.io/Shuffle-R1/
♻ ☆ Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion
Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model's robustness to hyperparameter variations.
♻ ☆ Graph Neural Networks for Road Safety Modeling: Datasets and Evaluations for Accident Analysis NeurIPS 2023
We consider the problem of traffic accident analysis on a road network based on road network connections and traffic volume. Previous works have designed various deep-learning methods using historical records to predict traffic accident occurrences. However, there is a lack of consensus on how accurate existing methods are, and a fundamental issue is the lack of public accident datasets for comprehensive evaluations. This paper constructs a large-scale, unified dataset of traffic accident records from official reports of various states in the US, totaling 9 million records, accompanied by road networks and traffic volume reports. Using this new dataset, we evaluate existing deep-learning methods for predicting the occurrence of accidents on road networks. Our main finding is that graph neural networks such as GraphSAGE can accurately predict the number of accidents on roads with less than 22% mean absolute error (relative to the actual count) and whether an accident will occur or not with over 87% AUROC, averaged over states. We achieve these results by using multitask learning to account for cross-state variabilities (e.g., availability of accident labels) and transfer learning to combine traffic volume with accident prediction. Ablation studies highlight the importance of road graph-structural features, amongst other features. Lastly, we discuss the implications of the analysis and develop a package for easily using our new dataset.
comment: 22 pages. Appeared in NeurIPS 2023
♻ ☆ Class-wise Balancing Data Replay for Federated Class-Incremental Learning NeurIPS'25
Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model's overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.
comment: NeurIPS'25 Accepted, Oral
♻ ☆ Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.
♻ ☆ Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization CIKM 2025
We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.
comment: 8 pages, 3 figures, Accepted at CIKM 2025 FinAI Workshop
♻ ☆ Measuring the Measures: Discriminative Capacity of Representational Similarity Metrics Across Model Families
Representational similarity metrics are fundamental tools in neuroscience and AI, yet we lack systematic comparisons of their discriminative power across model families. We introduce a quantitative framework to evaluate representational similarity measures based on their ability to separate model families-across architectures (CNNs, Vision Transformers, Swin Transformers, ConvNeXt) and training regimes (supervised vs. self-supervised). Using three complementary separability measures-dprime from signal detection theory, silhouette coefficients and ROC-AUC, we systematically assess the discriminative capacity of commonly used metrics including RSA, linear predictivity, Procrustes, and soft matching. We show that separability systematically increases as metrics impose more stringent alignment constraints. Among mapping-based approaches, soft-matching achieves the highest separability, followed by Procrustes alignment and linear predictivity. Non-fitting methods such as RSA also yield strong separability across families. These results provide the first systematic comparison of similarity metrics through a separability lens, clarifying their relative sensitivity and guiding metric choice for large-scale model and brain comparisons.
♻ ☆ Harnessing Test-time Adaptation for NLU tasks Involving Dialects of English
Test-time domain adaptation (TTDA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTDA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English (IndE), Singaporean English (SingE), or Nigerian English (NgE), of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTDA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data.
♻ ☆ Can We Validate Counterfactual Estimations in the Presence of General Network Interference?
Randomized experiments have become a cornerstone of evidence-based decision-making in contexts ranging from online platforms to public health. However, in experimental settings with network interference, a unit's treatment can influence outcomes of other units, challenging both causal effect estimation and its validation. Classic validation approaches fail as outcomes are only observable under a single treatment scenario and exhibit complex correlation patterns due to interference. To address these challenges, we introduce a framework that facilitates the use of machine learning tools for both estimation and validation in causal inference. Central to our approach is the new distribution-preserving network bootstrap, a theoretically-grounded technique that generates multiple statistically-valid subpopulations from a single experiment's data. This amplification of experimental samples enables our second contribution: a counterfactual cross-validation procedure. This procedure adapts the principles of model validation to the unique constraints of causal settings, providing a rigorous, data-driven method for selecting and evaluating estimators. We extend recent causal message-passing developments by incorporating heterogeneous unit-level characteristics and varying local interactions, ensuring reliable finite-sample performance through non-asymptotic analysis. Additionally, we develop and publicly release a comprehensive benchmark toolbox featuring diverse experimental environments, from networks of interacting AI agents to ride-sharing applications. These environments provide known ground truth values while maintaining realistic complexities, enabling systematic evaluation of causal inference methods. Extensive testing across these environments demonstrates our method's robustness to diverse forms of network interference.
♻ ☆ Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns
Internet memes have emerged as a popular multimodal medium, yet they are increasingly weaponized to convey harmful opinions through subtle rhetorical devices like irony and metaphor. Existing detection approaches, including MLLM-based techniques, struggle with these implicit expressions, leading to frequent misjudgments. This paper introduces PatMD, a novel approach that improves harmful meme detection by learning from and proactively mitigating these potential misjudgment risks. Our core idea is to move beyond superficial content-level matching and instead identify the underlying misjudgment risk patterns, proactively guiding the MLLMs to avoid known misjudgment pitfalls. We first construct a knowledge base where each meme is deconstructed into a misjudgment risk pattern explaining why it might be misjudged, either overlooking harmful undertones (false negative) or overinterpreting benign content (false positive). For a given target meme, PatMD retrieves relevant patterns and utilizes them to dynamically guide the MLLM's reasoning. Experiments on a benchmark of 6,626 memes across 5 harmful detection tasks show that PatMD outperforms state-of-the-art baselines, achieving an average of 8.30\% improvement in F1-score and 7.71\% improvement in accuracy, demonstrating strong generalizability and improved detection capability of harmful memes.
comment: The paper has something wrong and need to be corrected
♻ ☆ Fair Supervised Learning Through Constraints on Smooth Nonconvex Unfairness-Measure Surrogates
A new strategy for fair supervised machine learning is proposed. The main advantages of the proposed strategy as compared to others in the literature are as follows. (a) We introduce a new smooth nonconvex surrogate to approximate the Heaviside functions involved in discontinuous unfairness measures. The surrogate is based on smoothing methods from the optimization literature, and is new for the fair supervised learning literature. The surrogate is a tight approximation which ensures the trained prediction models are fair, as opposed to other (e.g., convex) surrogates that can fail to lead to a fair prediction model in practice. (b) Rather than rely on regularizers (that lead to optimization problems that are difficult to solve) and corresponding regularization parameters (that can be expensive to tune), we propose a strategy that employs hard constraints so that specific tolerances for unfairness can be enforced without the complications associated with the use of regularization. (c) Our proposed strategy readily allows for constraints on multiple (potentially conflicting) unfairness measures at the same time. Multiple measures can be considered with a regularization approach, but at the cost of having even more difficult optimization problems to solve and further expense for tuning. By contrast, through hard constraints, our strategy leads to optimization models that can be solved tractably with minimal tuning.
♻ ☆ Noise-Robustness Through Noise: A Framework combining Asymmetric LoRA with Poisoning MoE NeurIPS 2025
Current parameter-efficient fine-tuning methods for adapting pre-trained language models to downstream tasks are susceptible to interference from noisy data. Conventional noise-handling approaches either rely on laborious data pre-processing or employ model architecture modifications prone to error accumulation. In contrast to existing noise-process paradigms, we propose a noise-robust adaptation method via asymmetric LoRA poisoning experts (LoPE), a novel framework that enhances model robustness to noise only with generated noisy data. Drawing inspiration from the mixture-of-experts architecture, LoPE strategically integrates a dedicated poisoning expert in an asymmetric LoRA configuration. Through a two-stage paradigm, LoPE performs noise injection on the poisoning expert during fine-tuning to enhance its noise discrimination and processing ability. During inference, we selectively mask the dedicated poisoning expert to leverage purified knowledge acquired by normal experts for noise-robust output. Extensive experiments demonstrate that LoPE achieves strong performance and robustness purely through the low-cost noise injection, which completely eliminates the requirement of data cleaning.
comment: Accecpted to NeurIPS 2025
♻ ☆ TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding
We propose TPP-SD, a novel approach that accelerates Transformer temporal point process (TPP) sampling by adapting speculative decoding (SD) techniques from language models. By identifying the structural similarities between thinning algorithms for TPPs and speculative decoding for language models, we develop an efficient sampling framework that leverages a smaller draft model to generate multiple candidate events, which are then verified by the larger target model in parallel. TPP-SD maintains the same output distribution as autoregressive sampling while achieving significant acceleration. Experiments on both synthetic and real datasets demonstrate that our approach produces samples from identical distributions as standard methods, but with 2-6$\times$ speedup. Our ablation studies analyze the impact of hyperparameters such as draft length and draft model size on sampling efficiency. TPP-SD bridges the gap between powerful Transformer TPP models and the practical need for rapid sequence sampling.
♻ ☆ In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation NeurIPS 2025
This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.
comment: NeurIPS 2025 camera ready version. Slight change to title and author order from earlier version. Added experiments and acknowledgements section
♻ ☆ Understanding Differential Transformer Unchains Pretrained Self-Attentions NeurIPS 2025
Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01%).
comment: NeurIPS 2025
♻ ☆ Disaster Management in the Era of Agentic AI Systems: A Vision for Collective Human-Machine Intelligence for Augmented Resilience
The escalating frequency and severity of disasters routinely overwhelm traditional response capabilities, exposing critical vulnerability in disaster management. Current practices are hindered by fragmented data streams, siloed technologies, resource constraints, and the erosion of institutional memory, which collectively impede timely and effective decision making. This study introduces Disaster Copilot, a vision for a multi-agent artificial intelligence system designed to overcome these systemic challenges by unifying specialized AI tools within a collaborative framework. The proposed architecture utilizes a central orchestrator to coordinate diverse sub-agents, each specializing in critical domains such as predictive risk analytics, situational awareness, and impact assessment. By integrating multi-modal data, the system delivers a holistic, real-time operational picture and serve as the essential AI backbone required to advance Disaster Digital Twins from passive models to active, intelligent environments. Furthermore, it ensures functionality in resource-limited environments through on-device orchestration and incorporates mechanisms to capture institutional knowledge, mitigating the impact of staff turnover. We detail the system architecture and propose a three-phased roadmap emphasizing the parallel growth of technology, organizational capacity, and human-AI teaming. Disaster Copilot offers a transformative vision, fostering collective human-machine intelligence to build more adaptive, data-driven and resilient communities.
♻ ☆ A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications NeurIPS 2025
The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
comment: This paper is accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization
Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose T\'yr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that T\'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.
♻ ☆ ProtoTS: Learning Hierarchical Prototypes for Explainable Time Series Forecasting
While deep learning has achieved impressive performance in time series forecasting, it becomes increasingly crucial to understand its decision-making process for building trust in high-stakes scenarios. Existing interpretable models often provide only local and partial explanations, lacking the capability to reveal how heterogeneous and interacting input variables jointly shape the overall temporal patterns in the forecast curve. We propose ProtoTS, a novel interpretable forecasting framework that achieves both high accuracy and transparent decision-making through modeling prototypical temporal patterns. ProtoTS computes instance-prototype similarity based on a denoised representation that preserves abundant heterogeneous information. The prototypes are organized hierarchically to capture global temporal patterns with coarse prototypes while capturing finer-grained local variations with detailed prototypes, enabling expert steering and multi-level interpretability. Experiments on multiple realistic benchmarks, including a newly released LOF dataset, show that ProtoTS not only exceeds existing methods in forecast accuracy but also delivers expert-steerable interpretations for better model understanding and decision support.
comment: Under submission
♻ ☆ Kernel-Based Nonparametric Tests For Shape Constraints
We develop a reproducing kernel Hilbert space (RKHS) framework for nonparametric mean-variance optimization and inference on shape constraints of the optimal rule. We derive statistical properties of the sample estimator and provide rigorous theoretical guarantees, such as asymptotic consistency, a functional central limit theorem, and a finite-sample deviation bound that matches the Monte Carlo rate up to regularization. Building on these findings, we introduce a joint Wald-type statistic to test for shape constraints over finite grids. The approach comes with an efficient computational procedure based on a pivoted Cholesky factorization, facilitating scalability to large datasets. Empirical tests suggest favorably of the proposed methodology.
comment: 32 pages, 1 figure
♻ ☆ Implicit Neural Compression of Point Clouds
Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.
♻ ☆ Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference NeurIPS'25
We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at a rate that is nearly independent of explicit dimension dependence. Specifically, for a $d$-dimensional strongly log-concave and log-smooth target, the number of iterations for BBVI with a sub-Gaussian family to obtain a solution $\epsilon$-close to the global optimum has a dimension dependence of $\mathrm{O}(\log d)$. This is a significant improvement over the $\mathrm{O}(d)$ dependence of full-rank location-scale families. For heavy-tailed families, we prove a weaker $\mathrm{O}(d^{2/k})$ dependence, where $k$ is the number of finite moments of the family. Additionally, if the Hessian of the target log-density is constant, the complexity is free of any explicit dimension dependence. We also prove that our bound on the gradient variance, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.
comment: Accepted to NeurIPS'25
♻ ☆ Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards
LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.
♻ ☆ Generalized Principal-Agent Problem with a Learning Agent ICLR 2025
In classic principal-agent problems such as Stackelberg games, contract design, and Bayesian persuasion, the agent best responds to the principal's committed strategy. We study repeated generalized principal-agent problems under the assumption that the principal does not have commitment power and the agent uses algorithms to learn to respond to the principal. We reduce this problem to a one-shot problem where the agent approximately best responds, and prove that: (1) If the agent uses contextual no-regret learning algorithms with regret $\mathrm{Reg}(T)$, then the principal can guarantee utility at least $U^* - \Theta\big(\sqrt{\tfrac{\mathrm{Reg}(T)}{T}}\big)$, where $U^*$ is the principal's optimal utility in the classic model with a best-responding agent. (2) If the agent uses contextual no-swap-regret learning algorithms with swap-regret $\mathrm{SReg}(T)$, then the principal cannot obtain utility more than $U^* + O(\frac{\mathrm{SReg(T)}}{T})$. (3) In addition, if the agent uses mean-based learning algorithms (which can be no-regret but not no-swap-regret), then the principal can sometimes do significantly better than $U^*$. These results not only refine previous works on Stackelberg games and contract design, but also lead to new results for Bayesian persuasion with a learning agent and all generalized principal-agent problems where the agent does not have private information.
comment: A short version of this work appeared on ICLR 2025 (spotlight). This full version has been accepted by Quantitative Economics
Computer Vision and Pattern Recognition 200
☆ ConsistEdit: Highly Consistent and Precise Training-free Visual Editing SIGGRAPH
Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.
comment: SIGGRAPH Asia 2025
☆ Glyph: Scaling Context Windows via Visual-Text Compression
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
☆ UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.
☆ Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats
Commercial plant phenotyping systems using fixed cameras cannot perceive many plant details due to leaf occlusion. In this paper, we present Botany-Bot, a system for building detailed "annotated digital twins" of living plants using two stereo cameras, a digital turntable inside a lightbox, an industrial robot arm, and 3D segmentated Gaussian Splat models. We also present robot algorithms for manipulating leaves to take high-resolution indexable images of occluded details such as stem buds and the underside/topside of leaves. Results from experiments suggest that Botany-Bot can segment leaves with 90.8% accuracy, detect leaves with 86.2% accuracy, lift/push leaves with 77.9% accuracy, and take detailed overside/underside images with 77.3% accuracy. Code, videos, and datasets are available at https://berkeleyautomation.github.io/Botany-Bot/.
comment: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
☆ SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.
☆ Towards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion
Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as "black boxes," limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model's reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model's predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.
comment: 15 pages, 7 Figures, 3 Tables
☆ Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.
comment: 21 pages, 10 figures, 6 tables
☆ VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.
comment: 18 pages, 7 Figures,
☆ Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition
We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.
comment: 13 pages
☆ Can Image-To-Video Models Simulate Pedestrian Dynamics? ICML 2025
Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.
comment: Appeared in the ICML 2025 Workshop on Building Physically Plausible World Models, July 2025, https://physical-world-modeling.github.io/
☆ Signature Forgery Detection: Improving Cross-Dataset Generalization
Automated signature verification is a critical biometric technique used in banking, identity authentication, and legal documentation. Despite the notable progress achieved by deep learning methods, most approaches in offline signature verification still struggle to generalize across datasets, as variations in handwriting styles and acquisition protocols often degrade performance. This study investigates feature learning strategies for signature forgery detection, focusing on improving cross-dataset generalization -- that is, model robustness when trained on one dataset and tested on another. Using three public benchmarks -- CEDAR, ICDAR, and GPDS Synthetic -- two experimental pipelines were developed: one based on raw signature images and another employing a preprocessing method referred to as shell preprocessing. Several behavioral patterns were identified and analyzed; however, no definitive superiority between the two approaches was established. The results show that the raw-image model achieved higher performance across benchmarks, while the shell-based model demonstrated promising potential for future refinement toward robust, cross-domain signature verification.
comment: Undergraduate thesis (preprint)---submitted to Escola Polit\'ecnica, Universidade Federal do Rio de Janeiro (POLI/UFRJ). The final version will include official signatures and defense approval
☆ MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.
comment: Project Website: https://github.com/NJU-LINK/MT-Video-Bench
☆ Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions
3D Gaussian Splatting (3DGS) under raindrop conditions suffers from severe occlusions and optical distortions caused by raindrop contamination on the camera lens, substantially degrading reconstruction quality. Existing benchmarks typically evaluate 3DGS using synthetic raindrop images with known camera poses (constrained images), assuming ideal conditions. However, in real-world scenarios, raindrops often interfere with accurate camera pose estimation and point cloud initialization. Moreover, a significant domain gap between synthetic and real raindrops further impairs generalization. To tackle these issues, we introduce RaindropGS, a comprehensive benchmark designed to evaluate the full 3DGS pipeline-from unconstrained, raindrop-corrupted images to clear 3DGS reconstructions. Specifically, the whole benchmark pipeline consists of three parts: data preparation, data processing, and raindrop-aware 3DGS evaluation, including types of raindrop interference, camera pose estimation and point cloud initialization, single image rain removal comparison, and 3D Gaussian training comparison. First, we collect a real-world raindrop reconstruction dataset, in which each scene contains three aligned image sets: raindrop-focused, background-focused, and rain-free ground truth, enabling a comprehensive evaluation of reconstruction quality under different focus conditions. Through comprehensive experiments and analyses, we reveal critical insights into the performance limitations of existing 3DGS methods on unconstrained raindrop images and the varying impact of different pipeline components: the impact of camera focus position on 3DGS reconstruction performance, and the interference caused by inaccurate pose and point cloud initialization on reconstruction. These insights establish clear directions for developing more robust 3DGS methods under raindrop conditions.
☆ Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging
Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.
☆ Improving Cross-Patient Generalization in Parkinson's Disease Detection through Chunk-Based Analysis of Hand-Drawn Patterns
Parkinson's disease (PD) is a neurodegenerative disease affecting about 1% of people over the age of 60, causing motor impairments that impede hand coordination activities such as writing and drawing. Many approaches have tried to support early detection of Parkinson's disease based on hand-drawn images; however, we identified two major limitations in the related works: (1) the lack of sufficient datasets, (2) the robustness when dealing with unseen patient data. In this paper, we propose a new approach to detect Parkinson's disease that consists of two stages: The first stage classifies based on their drawing type(circle, meander, spiral), and the second stage extracts the required features from the images and detects Parkinson's disease. We overcame the previous two limitations by applying a chunking strategy where we divide each image into 2x2 chunks. Each chunk is processed separately when extracting features and recognizing Parkinson's disease indicators. To make the final classification, an ensemble method is used to merge the decisions made from each chunk. Our evaluation shows that our proposed approach outperforms the top performing state-of-the-art approaches, in particular on unseen patients. On the NewHandPD dataset our approach, it achieved 97.08% accuracy for seen patients and 94.91% for unseen patients, our proposed approach maintained a gap of only 2.17 percentage points, compared to the 4.76-point drop observed in prior work.
comment: 19 pages, 2 figures, 9 tables
☆ Elastic ViTs from Pretrained Models without Retraining NeurIPS 2025
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/
comment: Accepted at NeurIPS 2025
☆ GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver
While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at https://github.com/3145tttt/GAS.
☆ Towards 3D Objectness Learning in an Open World NeurIPS 2025
Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.
comment: Accepted by NeurIPS 2025
☆ Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.
comment: Final version published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Xplore link: https://ieeexplore.ieee.org/document/11199360
☆ Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model
Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.
☆ PICABench: How Far Are We from Physically Realistic Image Editing?
Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
☆ 4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.
☆ Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.
comment: 7 pages, 1 figure, FLTA 2025
☆ ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.
comment: 14 pages, 6 figures, 2 tables. Primary subject: cs.LG (Machine Learning) Cross-listed to: cs.CV (Computer Vision and Pattern Recognition), eess.IV (Image and Video Processing). Code available at: https://github.com/Bluesman79/ZACH-ViT Installation: pip install zachvit Paper licensed under CC BY-NC-ND 4.0. Code released under Apache 2.0 License
☆ Self-supervised Pre-training for Mapping of Archaeological Stone Wall in Historic Landscapes Using High-Resolution DEM Derivatives
Dry-stone walls hold significant heritage and environmental value. Mapping these structures is essential for ecosystem preservation and wildfire management in Australia. Yet, many walls remain unidentified due to their inaccessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable solution, but two major challenges persist: (1) visual occlusion of low-lying walls by dense vegetation, and (2) limited labeled data for supervised training. We propose DINO-CV, a segmentation framework for automatic mapping of low-lying dry-stone walls using high-resolution Airborne LiDAR-derived digital elevation models (DEMs). DEMs overcome visual occlusion by capturing terrain structures hidden beneath vegetation, enabling analysis of structural rather than spectral cues. DINO-CV introduces a self-supervised cross-view pre-training strategy based on knowledge distillation to mitigate data scarcity. It learns invariant visual and geometric representations across multiple DEM derivatives, supporting various vision backbones including ResNet, Wide ResNet, and Vision Transformers. Applied to the UNESCO World Heritage cultural landscape of Budj Bim, Victoria, the method identifies one of Australia's densest collections of colonial dry-stone walls beyond Indigenous heritage contexts. DINO-CV achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for automated dry-stone wall mapping in vegetated and heritage-rich environments with scarce annotations.
☆ CaMiT: A Time-Aware Car Model Dataset for Classification and Generation NeurIPS 2025
AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.
comment: To be published in NeurIPS 2025 Track on Datasets and Benchmarks
☆ ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input
Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants' ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: https://review-anon-io.github.io/ImaGGen.github.io/
☆ One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection CVPR2025
Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the "less is more" philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2's full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.
comment: Extended version of CVPR2025
☆ Integrating BIM and UAV-based photogrammetry for Automated 3D Structure Model Segmentation
The advancement of UAV technology has enabled efficient, non-contact structural health monitoring. Combined with photogrammetry, UAVs can capture high-resolution scans and reconstruct detailed 3D models of infrastructure. However, a key challenge remains in segmenting specific structural components from these models-a process traditionally reliant on time-consuming and error-prone manual labeling. To address this issue, we propose a machine learning-based framework for automated segmentation of 3D point clouds. Our approach uses the complementary strengths of real-world UAV-scanned point clouds and synthetic data generated from Building Information Modeling (BIM) to overcome the limitations associated with manual labeling. Validation on a railroad track dataset demonstrated high accuracy in identifying and segmenting major components such as rails and crossties. Moreover, by using smaller-scale datasets supplemented with BIM data, the framework significantly reduced training time while maintaining reasonable segmentation accuracy. This automated approach improves the precision and efficiency of 3D infrastructure model segmentation and advances the integration of UAV and BIM technologies in structural health monitoring and infrastructure management.
☆ ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling NeurIPS 2025
3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation. At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft's superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.
comment: NeurIPS 2025 Poster
☆ Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation
This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.
☆ MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.
comment: 16 pages, 3 tables, 1 figure
☆ Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset
With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model's limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model's ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches.
☆ PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction -- all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask -- suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.
☆ WP-CrackNet: A Collaborative Adversarial Learning Framework for End-to-End Weakly-Supervised Road Crack Detection
Road crack detection is essential for intelligent infrastructure maintenance in smart cities. To reduce reliance on costly pixel-level annotations, we propose WP-CrackNet, an end-to-end weakly-supervised method that trains with only image-level labels for pixel-wise crack detection. WP-CrackNet integrates three components: a classifier generating class activation maps (CAMs), a reconstructor measuring feature inferability, and a detector producing pixel-wise road crack detection results. During training, the classifier and reconstructor alternate in adversarial learning to encourage crack CAMs to cover complete crack regions, while the detector learns from pseudo labels derived from post-processed crack CAMs. This mutual feedback among the three components improves learning stability and detection accuracy. To further boost detection performance, we design a path-aware attention module (PAAM) that fuses high-level semantics from the classifier with low-level structural cues from the reconstructor by modeling spatial and channel-wise dependencies. Additionally, a center-enhanced CAM consistency module (CECCM) is proposed to refine crack CAMs using center Gaussian weighting and consistency constraints, enabling better pseudo-label generation. We create three image-level datasets and extensive experiments show that WP-CrackNet achieves comparable results to supervised methods and outperforms existing weakly-supervised methods, significantly advancing scalable road inspection. The source code package and datasets are available at https://mias.group/WP-CrackNet/.
☆ Detecting streaks in smart telescopes images with Deep Learning
The growing negative impact of the visibility of satellites in the night sky is influencing the practice of astronomy and astrophotograph, both at the amateur and professional levels. The presence of these satellites has the effect of introducing streaks into the images captured during astronomical observation, requiring the application of additional post processing to mitigate the undesirable impact, whether for data loss or cosmetic reasons. In this paper, we show how we test and adapt various Deep Learning approaches to detect streaks in raw astronomical data captured between March 2022 and February 2023 with smart telescopes.
comment: 19 pages, preprint submitted to the Springer CCIS Special Issue on DATA 2024 (currently under editorial processing)
☆ MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation
Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.
☆ MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{https://github.com/Shopee-MUG/MUG-V}{our webpage}.
comment: Technical Report; Project Page: \href{https://github.com/Shopee-MUG/MUG-V}
☆ Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital. Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues. More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy. This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance. The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.
☆ Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment
Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.
☆ SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries
Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios.In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at https://github.com/MSunDYY/SparseWorld.
comment: Under Review
☆ Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS
Sparse-view 3D Gaussian Splatting (3DGS) often overfits to the training views, leading to artifacts like blurring in novel view rendering. Prior work addresses it either by enhancing the initialization (\emph{i.e.}, the point cloud from Structure-from-Motion (SfM)) or by adding training-time constraints (regularization) to the 3DGS optimization. Yet our controlled ablations reveal that initialization is the decisive factor: it determines the attainable performance band in sparse-view 3DGS, while training-time constraints yield only modest within-band improvements at extra cost. Given initialization's primacy, we focus our design there. Although SfM performs poorly under sparse views due to its reliance on feature matching, it still provides reliable seed points. Thus, building on SfM, our effort aims to supplement the regions it fails to cover as comprehensively as possible. Specifically, we design: (i) frequency-aware SfM that improves low-texture coverage via low-frequency view augmentation and relaxed multi-view correspondences; (ii) 3DGS self-initialization that lifts photometric supervision into additional points, compensating SfM-sparse regions with learned Gaussian centers; and (iii) point-cloud regularization that enforces multi-view consistency and uniform spatial coverage through simple geometric/visibility priors, yielding a clean and reliable point cloud. Our experiments on LLFF and Mip-NeRF360 demonstrate consistent gains in sparse-view settings, establishing our approach as a stronger initialization strategy. Code is available at https://github.com/zss171999645/ItG-GS.
comment: A preprint paper
☆ Rethinking Nighttime Image Deraining via Learnable Color Space Transformation NeurIPS 2025
Compared to daytime image deraining, nighttime image deraining poses significant challenges due to inherent complexities of nighttime scenarios and the lack of high-quality datasets that accurately represent the coupling effect between rain and illumination. In this paper, we rethink the task of nighttime image deraining and contribute a new high-quality benchmark, HQ-NightRain, which offers higher harmony and realism compared to existing datasets. In addition, we develop an effective Color Space Transformation Network (CST-Net) for better removing complex rain from nighttime scenes. Specifically, we propose a learnable color space converter (CSC) to better facilitate rain removal in the Y channel, as nighttime rain is more pronounced in the Y channel compared to the RGB color space. To capture illumination information for guiding nighttime deraining, implicit illumination guidance is introduced enabling the learned features to improve the model's robustness in complex scenarios. Extensive experiments show the value of our dataset and the effectiveness of our method. The source code and datasets are available at https://github.com/guanqiyuan/CST-Net.
comment: Accepted by NeurIPS 2025
☆ From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.
comment: Project page: https://falcon-vla.github.io/
☆ Leveraging AV1 motion vectors for Fast and Dense Feature Matching
We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.
comment: Accepted ICIR 2025, camera-ready version
☆ DeepDetect: Learning All-in-One Dense Keypoints
Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).
comment: 6 pages, 6 figures, 2 tables, 7 equations
☆ Monitoring Horses in Stalls: From Object to Event Detection
Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera's blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.
comment: 12 pages, 4 figures, 4 tables
☆ MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning
The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.
comment: Accepted and presented at the 2025 International Joint Conference on Neural Networks (IJCNN'25). The paper was awarded an honorable mention (best 4 papers)
☆ Closed-Loop Transfer for Weakly-supervised Affordance Grounding ICCV 2025
Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.
comment: Accepted at ICCV 2025
☆ Latent Spaces Beyond Synthesis: From GANs to Diffusion Models
This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi's account of synthesis as the amalgamation of distributed representations, we propose a distinction between "synthesis in a strict sense", where a compact latent space wholly determines the generative process, and "synthesis in a broad sense," which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.
comment: Presented and published at Ethics and Aesthetics of Artificial Intelligence Conference (EA-AI'25)
☆ Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing
Parkinson's disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients' "masked face" symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.
comment: 3 pages, 2 figures, accepted by MIND 2025
☆ Beyond Real Faces: Synthetic Datasets Can Achieve Reliable Recognition Performance without Privacy Compromise
The deployment of facial recognition systems has created an ethical dilemma: achieving high accuracy requires massive datasets of real faces collected without consent, leading to dataset retractions and potential legal liabilities under regulations like GDPR. While synthetic facial data presents a promising privacy-preserving alternative, the field lacks comprehensive empirical evidence of its viability. This study addresses this critical gap through extensive evaluation of synthetic facial recognition datasets. We present a systematic literature review identifying 25 synthetic facial recognition datasets (2018-2025), combined with rigorous experimental validation. Our methodology examines seven key requirements for privacy-preserving synthetic data: identity leakage prevention, intra-class variability, identity separability, dataset scale, ethical data sourcing, bias mitigation, and benchmark reliability. Through experiments involving over 10 million synthetic samples, extended by a comparison of results reported on five standard benchmarks, we provide the first comprehensive empirical assessment of synthetic data's capability to replace real datasets. Best-performing synthetic datasets (VariFace, VIGFace) achieve recognition accuracies of 95.67% and 94.91% respectively, surpassing established real datasets including CASIA-WebFace (94.70%). While those images remain private, publicly available alternatives Vec2Face (93.52%) and CemiFace (93.22%) come close behind. Our findings reveal that they ensure proper intra-class variability while maintaining identity separability. Demographic bias analysis shows that, even though synthetic data inherits limited biases, it offers unprecedented control for bias mitigation through generation parameters. These results establish synthetic facial data as a scientifically viable and ethically imperative alternative for facial recognition research.
☆ Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs NeurIPS 2025
Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.
comment: NeurIPS 2025
☆ M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
comment: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures
☆ Exploring The Missing Semantics In Event Modality
Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.
☆ Nearest-Class Mean and Logits Agreement for Wildlife Open-Set Recognition
Current state-of-the-art Wildlife classification models are trained under the closed world setting. When exposed to unknown classes, they remain overconfident in their predictions. Open-set Recognition (OSR) aims to classify known classes while rejecting unknown samples. Several OSR methods have been proposed to model the closed-set distribution by observing the feature, logit, or softmax probability space. A significant drawback of many existing approaches is the requirement to retrain the pre-trained classification model with the OSR-specific strategy. This study contributes a post-processing OSR method that measures the agreement between the models' features and predicted logits. We propose a probability distribution based on an input's distance to its Nearest Class Mean (NCM). The NCM-based distribution is then compared with the softmax probabilities from the logit space to measure agreement between the NCM and the classification head. Our proposed strategy ranks within the top three on two evaluated datasets, showing consistent performance across the two datasets. In contrast, current state-of-the-art methods excel on a single dataset. We achieve an AUROC of 93.41 and 95.35 for African and Swedish animals. The code can be found https://github.com/Applied-Representation-Learning-Lab/OSR.
☆ iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA ICCV 2025
Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.
comment: Accepted to ICCV 2025 Workshop
☆ CharDiff: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration
The significance of license plate image restoration goes beyond the preprocessing stage of License Plate Recognition (LPR) systems, as it also serves various purposes, including increasing evidential value, enhancing the clarity of visual interface, and facilitating further utilization of license plate images. We propose a novel diffusion-based framework with character-level guidance, CharDiff, which effectively restores and recognizes severely degraded license plate images captured under realistic conditions. CharDiff leverages fine-grained character-level priors extracted through external segmentation and Optical Character Recognition (OCR) modules tailored for low-quality license plate images. For precise and focused guidance, CharDiff incorporates a novel Character-guided Attention through Region-wise Masking (CHARM) module, which ensures that each character's guidance is restricted to its own region, thereby avoiding interference with other regions. In experiments, CharDiff significantly outperformed the baseline restoration models in both restoration quality and recognition accuracy, achieving a 28% relative reduction in CER on the Roboflow-LP dataset, compared to the best-performing baseline model. These results indicate that the structured character-guided conditioning effectively enhances the robustness of diffusion-based license plate restoration and recognition in practical deployment scenarios.
comment: 11 pages, 6 figures
☆ A Single Set of Adversarial Clothes Breaks Multiple Defense Methods in the Physical World
In recent years, adversarial attacks against deep learning-based object detectors in the physical world have attracted much attention. To defend against these attacks, researchers have proposed various defense methods against adversarial patches, a typical form of physically-realizable attack. However, our experiments showed that simply enlarging the patch size could make these defense methods fail. Motivated by this, we evaluated various defense methods against adversarial clothes which have large coverage over the human body. Adversarial clothes provide a good test case for adversarial defense against patch-based attacks because they not only have large sizes but also look more natural than a large patch on humans. Experiments show that all the defense methods had poor performance against adversarial clothes in both the digital world and the physical world. In addition, we crafted a single set of clothes that broke multiple defense methods on Faster R-CNN. The set achieved an Attack Success Rate (ASR) of 96.06% against the undefended detector and over 64.84% ASRs against nine defended models in the physical world, unveiling the common vulnerability of existing adversarial defense methods against adversarial clothes. Code is available at: https://github.com/weiz0823/adv-clothes-break-multiple-defenses.
comment: 13 pages, 8 figures
☆ CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference
We introduce CausalMamba, a scalable framework that addresses fundamental limitations in fMRI-based causal inference: the ill-posed nature of inferring neural causality from hemodynamically distorted BOLD signals and the computational intractability of existing methods like Dynamic Causal Modeling (DCM). Our approach decomposes this complex inverse problem into two tractable stages: BOLD deconvolution to recover latent neural activity, followed by causal graph inference using a novel Conditional Mamba architecture. On simulated data, CausalMamba achieves 37% higher accuracy than DCM. Critically, when applied to real task fMRI data, our method recovers well-established neural pathways with 88% fidelity, whereas conventional approaches fail to identify these canonical circuits in over 99% of subjects. Furthermore, our network analysis of working memory data reveals that the brain strategically shifts its primary causal hub-recruiting executive or salience networks depending on the stimulus-a sophisticated reconfiguration that remains undetected by traditional methods. This work provides neuroscientists with a practical tool for large-scale causal inference that captures both fundamental circuit motifs and flexible network dynamics underlying cognitive function.
☆ LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.
comment: Submitted to ARR Rolling Review
☆ Exploring Structural Degradation in Dense Representations for Self-supervised Learning NeurIPS 2025
In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by $3.0\%$ on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at https://github.com/EldercatSAM/SSL-Degradation.
comment: Accepted by NeurIPS 2025
☆ Machine Vision-Based Surgical Lighting System:Design and Implementation
Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.
☆ SG-CLDFF: A Novel Framework for Automated White Blood Cell Classification and Segmentation
Accurate segmentation and classification of white blood cells (WBCs) in microscopic images are essential for diagnosis and monitoring of many hematological disorders, yet remain challenging due to staining variability, complex backgrounds, and class imbalance. In this paper, we introduce a novel Saliency-Guided Cross-Layer Deep Feature Fusion framework (SG-CLDFF) that tightly integrates saliency-driven preprocessing with multi-scale deep feature aggregation to improve both robustness and interpretability for WBC analysis. SG-CLDFF first computes saliency priors to highlight candidate WBC regions and guide subsequent feature extraction. A lightweight hybrid backbone (EfficientSwin-style) produces multi-resolution representations, which are fused by a ResNeXt-CC-inspired cross-layer fusion module to preserve complementary information from shallow and deep layers. The network is trained in a multi-task setup with concurrent segmentation and cell-type classification heads, using class-aware weighted losses and saliency-alignment regularization to mitigate imbalance and suppress background activation. Interpretability is enforced through Grad-CAM visualizations and saliency consistency checks, allowing model decisions to be inspected at the regional level. We validate the framework on standard public benchmarks (BCCD, LISC, ALL-IDB), reporting consistent gains in IoU, F1, and classification accuracy compared to strong CNN and transformer baselines. An ablation study also demonstrates the individual contributions of saliency preprocessing and cross-layer fusion. SG-CLDFF offers a practical and explainable path toward more reliable automated WBC analysis in clinical workflows.
☆ Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models
Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning -- making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.
comment: In proceedings of IROS 2025
☆ FineVision: Open Data Is All You Need
The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
☆ Fair and Interpretable Deepfake Detection in Videos
Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.
comment: 10 pages (including References)
☆ From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models
Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.
☆ Taming Modality Entanglement in Continual Audio-Visual Segmentation
Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.
☆ When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions NeurIPS 2025
Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.
comment: Accepted to NeurIPS 2025
☆ $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.
comment: EMNLP 2025 Main
☆ Optimizing DINOv2 with Registers for Face Anti-Spoofing ICCV 2025
Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025'' and SiW dataset.
comment: ICCV 2025 Workshop FAS
☆ EndoCIL: A Class-Incremental Learning Framework for Endoscopic Image Classification
Class-incremental learning (CIL) for endoscopic image analysis is crucial for real-world clinical applications, where diagnostic models should continuously adapt to evolving clinical data while retaining performance on previously learned ones. However, existing replay-based CIL methods fail to effectively mitigate catastrophic forgetting due to severe domain discrepancies and class imbalance inherent in endoscopic imaging. To tackle these challenges, we propose EndoCIL, a novel and unified CIL framework specifically tailored for endoscopic image diagnosis. EndoCIL incorporates three key components: Maximum Mean Discrepancy Based Replay (MDBR), employing a distribution-aligned greedy strategy to select diverse and representative exemplars, Prior Regularized Class Balanced Loss (PRCBL), designed to alleviate both inter-phase and intra-phase class imbalance by integrating prior class distributions and balance weights into the loss function, and Calibration of Fully-Connected Gradients (CFG), which adjusts the classifier gradients to mitigate bias toward new classes. Extensive experiments conducted on four public endoscopic datasets demonstrate that EndoCIL generally outperforms state-of-the-art CIL methods across varying buffer sizes and evaluation metrics. The proposed framework effectively balances stability and plasticity in lifelong endoscopic diagnosis, showing promising potential for clinical scalability and deployment.
☆ Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis
Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.
comment: Accepted to IEEE 2025 Conference on Games
☆ From Pixels to People: Satellite-Based Mapping and Quantification of Riverbank Erosion and Lost Villages in Bangladesh
The great rivers of Bangladesh, arteries of commerce and sustenance, are also agents of relentless destruction. Each year, they swallow whole villages and vast tracts of farmland, erasing communities from the map and displacing thousands of families. To track this slow-motion catastrophe has, until now, been a Herculean task for human analysts. Here we show how a powerful general-purpose vision model, the Segment Anything Model (SAM), can be adapted to this task with remarkable precision. To do this, we assembled a new dataset - a digital chronicle of loss compiled from historical Google Earth imagery of Bangladesh's most vulnerable regions, including Mokterer Char Union, Kedarpur Union, Balchipara village, and Chowhali Upazila, from 2003 to 2025. Crucially, this dataset is the first to include manually annotated data on the settlements that have vanished beneath the water. Our method first uses a simple color-channel analysis to provide a rough segmentation of land and water, and then fine-tunes SAM's mask decoder to recognize the subtle signatures of riverbank erosion. The resulting model demonstrates a keen eye for this destructive process, achieving a mean Intersection over Union of 86.30% and a Dice score of 92.60% - a performance that significantly surpasses traditional methods and off-the-shelf deep learning models. This work delivers three key contributions: the first annotated dataset of disappeared settlements in Bangladesh due to river erosion; a specialized AI model fine-tuned for this critical task; and a method for quantifying land loss with compelling visual evidence. Together, these tools provide a powerful new lens through which policymakers and disaster management agencies can monitor erosion, anticipate its trajectory, and ultimately protect the vulnerable communities in its path.
comment: Submitted to the International Conference on Data and Applied Analytics (IDAA 2025). 15 pages, 5 figures, 4 tables
☆ ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.
☆ HIDISC: A Hyperbolic Framework for Domain Generalization with Generalized Category Discovery NeurIPS
Generalized Category Discovery (GCD) aims to classify test-time samples into either seen categories** -- available during training -- or novel ones, without relying on label supervision. Most existing GCD methods assume simultaneous access to labeled and unlabeled data during training and arising from the same domain, limiting applicability in open-world scenarios involving distribution shifts. Domain Generalization with GCD (DG-GCD) lifts this constraint by requiring models to generalize to unseen domains containing novel categories, without accessing targetdomain data during training. The only prior DG-GCD method, DG2CD-Net, relies on episodic training with multiple synthetic domains and task vector aggregation, incurring high computational cost and error accumulation. We propose HIDISC, a hyperbolic representation learning framework that achieves domain and category-level generalization without episodic simulation. To expose the model to minimal but diverse domain variations, we augment the source domain using GPT-guided diffusion, avoiding overfitting while maintaining efficiency. To structure the representation space, we introduce Tangent CutMix, a curvature-aware interpolation that synthesizes pseudo-novel samples in tangent space, preserving manifold consistency. A unified loss -- combining penalized Busemann alignment, hybrid hyperbolic contrastive regularization, and adaptive outlier repulsion -- **facilitates compact, semantically structured embeddings. A learnable curvature parameter further adapts the geometry to dataset complexity. HIDISC achieves state-of-the-art results on PACS , Office-Home , and DomainNet, consistently outperforming the existing Euclidean and hyperbolic (DG)-GCD baselines.
comment: Accpeted at NeurIPS (2025) Main Conference
☆ Capturing Head Avatar with Hand Contacts from a Monocular Video ICCV 2025
Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions. There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results. We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods.
comment: ICCV 2025
☆ Benchmarking Out-of-Distribution Detection for Plankton Recognition: A Systematic Evaluation of Advanced Methods in Marine Ecological Monitoring
Automated plankton recognition models face significant challenges during real-world deployment due to distribution shifts (Out-of-Distribution, OoD) between training and test data. This stems from plankton's complex morphologies, vast species diversity, and the continuous discovery of novel species, which leads to unpredictable errors during inference. Despite rapid advancements in OoD detection methods in recent years, the field of plankton recognition still lacks a systematic integration of the latest computer vision developments and a unified benchmark for large-scale evaluation. To address this, this paper meticulously designed a series of OoD benchmarks simulating various distribution shift scenarios based on the DYB-PlanktonNet dataset \cite{875n-f104-21}, and systematically evaluated twenty-two OoD detection methods. Extensive experimental results demonstrate that the ViM \cite{wang2022vim} method significantly outperforms other approaches in our constructed benchmarks, particularly excelling in Far-OoD scenarios with substantial improvements in key metrics. This comprehensive evaluation not only provides a reliable reference for algorithm selection in automated plankton recognition but also lays a solid foundation for future research in plankton OoD detection. To our knowledge, this study marks the first large-scale, systematic evaluation and analysis of Out-of-Distribution data detection methods in plankton recognition. Code is available at https://github.com/BlackJack0083/PlanktonOoD.
☆ Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.
comment: 12 pages, 6 figures
☆ Investigating Adversarial Robustness against Preprocessing used in Blackbox Face Recognition
Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.
comment: Accepted for publication in DICTA 2025
☆ GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image
Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.
☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
☆ KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation
Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.
☆ GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier's latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.
comment: 28 pages, 16 figures, conference
☆ Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation
We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.
☆ Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras
This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source's spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system's sensitivity to visible spectra, modern consumer camera sensors' spectral sensitivity, and narrowband LEDs' ability to generate broadband spectra perceived as "white light" (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.
☆ Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement
Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.
☆ Shape-aware Inertial Poser: Motion Tracking for Humans with Diverse Shapes Using Sparse Inertial Sensors SIGGRAPH
Human motion capture with sparse inertial sensors has gained significant attention recently. However, existing methods almost exclusively rely on a template adult body shape to model the training data, which poses challenges when generalizing to individuals with largely different body shapes (such as a child). This is primarily due to the variation in IMU-measured acceleration caused by changes in body shape. To fill this gap, we propose Shape-aware Inertial Poser (SAIP), the first solution considering body shape differences in sparse inertial-based motion capture. Specifically, we decompose the sensor measurements related to shape and pose in order to effectively model their joint correlations. Firstly, we train a regression model to transfer the IMU-measured accelerations of a real body to match the template adult body model, compensating for the shape-related sensor measurements. Then, we can easily follow the state-of-the-art methods to estimate the full body motions of the template-shaped body. Finally, we utilize a second regression model to map the joint velocities back to the real body, combined with a shape-aware physical optimization strategy to calculate global motions on the subject. Furthermore, our method relies on body shape awareness, introducing the first inertial shape estimation scheme. This is accomplished by modeling the shape-conditioned IMU-pose correlation using an MLP-based network. To validate the effectiveness of SAIP, we also present the first IMU motion capture dataset containing individuals of different body sizes. This dataset features 10 children and 10 adults, with heights ranging from 110 cm to 190 cm, and a total of 400 minutes of paired IMU-Motion samples. Extensive experimental results demonstrate that SAIP can effectively handle motion capture tasks for diverse body shapes. The code and dataset are available at https://github.com/yinlu5942/SAIP.
comment: Accepted by SIGGRAPH Asia 2025 (TOG)
☆ GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation
Planes are fundamental primitives of 3D sences, especially in man-made environments such as indoor spaces and urban streets. Representing these planes in a structured and parameterized format facilitates scene editing and physical simulations in downstream applications. Recently, Gaussian Splatting (GS) has demonstrated remarkable effectiveness in the Novel View Synthesis task, with extensions showing great potential in accurate surface reconstruction. However, even state-of-the-art GS representations often struggle to reconstruct planar regions with sufficient smoothness and precision. To address this issue, we propose GSPlane, which recovers accurate geometry and produces clean and well-structured mesh connectivity for plane regions in the reconstructed scene. By leveraging off-the-shelf segmentation and normal prediction models, GSPlane extracts robust planar priors to establish structured representations for planar Gaussian coordinates, which help guide the training process by enforcing geometric consistency. To further enhance training robustness, a Dynamic Gaussian Re-classifier is introduced to adaptively reclassify planar Gaussians with persistently high gradients as non-planar, ensuring more reliable optimization. Furthermore, we utilize the optimized planar priors to refine the mesh layouts, significantly improving topological structure while reducing the number of vertices and faces. We also explore applications of the structured planar representation, which enable decoupling and flexible manipulation of objects on supportive planes. Extensive experiments demonstrate that, with no sacrifice in rendering quality, the introduction of planar priors significantly improves the geometric accuracy of the extracted meshes across various baselines.
☆ Towards a Generalizable Fusion Architecture for Multimodal Object Detection ICCV 2025
Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.
comment: 8 pages, 8 figures, accepted at ICCV 2025 MIRA Workshop
☆ ProDAT: Progressive Density-Aware Tail-Drop for Point Cloud Coding
Three-dimensional (3D) point clouds are becoming increasingly vital in applications such as autonomous driving, augmented reality, and immersive communication, demanding real-time processing and low latency. However, their large data volumes and bandwidth constraints hinder the deployment of high-quality services in resource-limited environments. Progres- sive coding, which allows for decoding at varying levels of detail, provides an alternative by allowing initial partial decoding with subsequent refinement. Although recent learning-based point cloud geometry coding methods have achieved notable success, their fixed latent representation does not support progressive decoding. To bridge this gap, we propose ProDAT, a novel density-aware tail-drop mechanism for progressive point cloud coding. By leveraging density information as a guidance signal, latent features and coordinates are decoded adaptively based on their significance, therefore achieving progressive decoding at multiple bitrates using one single model. Experimental results on benchmark datasets show that the proposed ProDAT not only enables progressive coding but also achieves superior coding efficiency compared to state-of-the-art learning-based coding techniques, with over 28.6% BD-rate improvement for PSNR- D2 on SemanticKITTI and over 18.15% for ShapeNet
☆ Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.
comment: 7 pages, 1 figure, FLTA 2025
☆ Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset ICCV
Accurate 3D reconstruction of lunar surfaces is essential for space exploration. However, existing stereo vision reconstruction methods struggle in this context due to the Moon's lack of texture, difficult lighting variations, and atypical orbital trajectories. State-of-the-art deep learning models, trained on human-scale datasets, have rarely been tested on planetary imagery and cannot be transferred directly to lunar conditions. To address this issue, we introduce LunarStereo, the first open dataset of photorealistic stereo image pairs of the Moon, simulated using ray tracing based on high-resolution topography and reflectance models. It covers diverse altitudes, lighting conditions, and viewing angles around the lunar South Pole, offering physically grounded supervision for 3D reconstruction tasks. Based on this dataset, we adapt the MASt3R model to the lunar domain through fine-tuning on LunarStereo. We validate our approach through extensive qualitative and quantitative experiments on both synthetic and real lunar data, evaluating 3D surface reconstruction and relative pose estimation. Extensive experiments on synthetic and real lunar data validate the approach, demonstrating significant improvements over zero-shot baselines and paving the way for robust cross-scale generalization in extraterrestrial environments.
comment: Accepted to ICCV workshop 2025. The project page can be accessed via this https://clementinegrethen.github.io/publications/3D-Vast-ICCV2025.html URL. The source code is available at this https://github.com/clementinegrethen/StereoLunar URL
☆ World-in-World: World Models in a Closed-Loop World
Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
comment: Code is at https://github.com/World-In-World/world-in-world
☆ SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving
Collaborative driving systems leverage vehicle-to-everything (V2X) communication across multiple agents to enhance driving safety and efficiency. Traditional V2X systems take raw sensor data, neural features, or perception results as communication media, which face persistent challenges, including high bandwidth demands, semantic loss, and interoperability issues. Recent advances investigate natural language as a promising medium, which can provide semantic richness, decision-level reasoning, and human-machine interoperability at significantly lower bandwidth. Despite great promise, this paradigm shift also introduces new vulnerabilities within language communication, including message loss, hallucinations, semantic manipulation, and adversarial attacks. In this work, we present the first systematic study of full-stack safety and security issues in natural-language-based collaborative driving. Specifically, we develop a comprehensive taxonomy of attack strategies, including connection disruption, relay/replay interference, content spoofing, and multi-connection forgery. To mitigate these risks, we introduce an agentic defense pipeline, which we call SafeCoop, that integrates a semantic firewall, language-perception consistency checks, and multi-source consensus, enabled by an agentic transformation function for cross-frame spatial alignment. We systematically evaluate SafeCoop in closed-loop CARLA simulation across 32 critical scenarios, achieving 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection. This study provides guidance for advancing research on safe, secure, and trustworthy language-driven collaboration in transportation systems. Our project page is https://xiangbogaobarry.github.io/SafeCoop.
☆ Online In-Context Distillation for Low-Resource Vision Language Models
As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.
☆ From Volume Rendering to 3D Gaussian Splatting: Theory and Applications
The problem of 3D reconstruction from posed images is undergoing a fundamental transformation, driven by continuous advances in 3D Gaussian Splatting (3DGS). By modeling scenes explicitly as collections of 3D Gaussians, 3DGS enables efficient rasterization through volumetric splatting, offering thus a seamless integration with common graphics pipelines. Despite its real-time rendering capabilities for novel view synthesis, 3DGS suffers from a high memory footprint, the tendency to bake lighting effects directly into its representation, and limited support for secondary-ray effects. This tutorial provides a concise yet comprehensive overview of the 3DGS pipeline, starting from its splatting formulation and then exploring the main efforts in addressing its limitations. Finally, we survey a range of applications that leverage 3DGS for surface reconstruction, avatar modeling, animation, and content generation-highlighting its efficient rendering and suitability for feed-forward pipelines.
comment: Accepted at the Conference on Graphics, Patterns and Images (SIBGRAPI), math focused, 5 equations, 5 Figure, 5 pages of text and 1 of bibligraphy
☆ Accelerating Vision Transformers with Adaptive Patch Sizes
Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.
comment: Project page at https://rccchoudhury.github.io/apt/
☆ Big Data, Tiny Targets: An Exploratory Study in Machine Learning-enhanced Detection of Microplastic from Filters
Microplastics (MPs) are ubiquitous pollutants with demonstrated potential to impact ecosystems and human health. Their microscopic size complicates detection, classification, and removal, especially in biological and environmental samples. While techniques like optical microscopy, Scanning Electron Microscopy (SEM), and Atomic Force Microscopy (AFM) provide a sound basis for detection, applying these approaches requires usually manual analysis and prevents efficient use in large screening studies. To this end, machine learning (ML) has emerged as a powerful tool in advancing microplastic detection. In this exploratory study, we investigate potential, limitations and future directions of advancing the detection and quantification of MP particles and fibres using a combination of SEM imaging and machine learning-based object detection. For simplicity, we focus on a filtration scenario where image backgrounds exhibit a symmetric and repetitive pattern. Our findings indicate differences in the quality of YOLO models for the given task and the relevance of optimizing preprocessing. At the same time, we identify open challenges, such as limited amounts of expert-labeled data necessary for reliable training of ML models.
☆ Chimera: Compositional Image Generation using Part-based Concepting
Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.
☆ HouseTour: A Virtual Real Estate A(I)gent ICCV 2025
We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
comment: Published on ICCV 2025
☆ TriggerNet: A Novel Explainable AI Framework for Red Palm Mite Detection and Multi-Model Comparison and Heuristic-Guided Annotation
The red palm mite infestation has become a serious concern, particularly in regions with extensive palm cultivation, leading to reduced productivity and economic losses. Accurate and early identification of mite-infested plants is critical for effective management. The current study focuses on evaluating and comparing the ML model for classifying the affected plants and detecting the infestation. TriggerNet is a novel interpretable AI framework that integrates Grad-CAM, RISE, FullGrad, and TCAV to generate novel visual explanations for deep learning models in plant classification and disease detection. This study applies TriggerNet to address red palm mite (Raoiella indica) infestation, a major threat to palm cultivation and agricultural productivity. A diverse set of RGB images across 11 plant species, Arecanut, Date Palm, Bird of Paradise, Coconut Palm, Ginger, Citrus Tree, Palm Oil, Orchid, Banana Palm, Avocado Tree, and Cast Iron Plant was utilized for training and evaluation. Advanced deep learning models like CNN, EfficientNet, MobileNet, ViT, ResNet50, and InceptionV3, alongside machine learning classifiers such as Random Forest, SVM, and KNN, were employed for plant classification. For disease classification, all plants were categorized into four classes: Healthy, Yellow Spots, Reddish Bronzing, and Silk Webbing. Snorkel was used to efficiently label these disease classes by leveraging heuristic rules and patterns, reducing manual annotation time and improving dataset reliability.
comment: 17 pages, 9 figures
☆ SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection
Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.
comment: 8 pages, 5 figures
☆ ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues
Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43\% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .
comment: 10 pages, 4 figures, 2 tables
☆ ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy ICCV 2025
We present ManzaiSet, the first large scale multimodal dataset of viewer responses to Japanese manzai comedy, capturing facial videos and audio from 241 participants watching up to 10 professional performances in randomized order (94.6 percent watched >= 8; analyses focus on n=228). This addresses the Western centric bias in affective computing. Three key findings emerge: (1) k means clustering identified three distinct viewer types: High and Stable Appreciators (72.8 percent, n=166), Low and Variable Decliners (13.2 percent, n=30), and Variable Improvers (14.0 percent, n=32), with heterogeneity of variance (Brown Forsythe p < 0.001); (2) individual level analysis revealed a positive viewing order effect (mean slope = 0.488, t(227) = 5.42, p < 0.001, permutation p < 0.001), contradicting fatigue hypotheses; (3) automated humor classification (77 instances, 131 labels) plus viewer level response modeling found no type wise differences after FDR correction. The dataset enables culturally aware emotion AI development and personalized entertainment systems tailored to non Western contexts.
comment: ICCV 2025 Workshop on Affective & Behavior Analysis in-the-Wild (ABAW), Honolulu, HI, USA (Oct 19, 2025, HST). 11 pages, 5 figures
☆ Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods
Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.
☆ Demystifying Transition Matching: When and Why It Can Beat Flow Matching
Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.
☆ Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.
♻ ☆ SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model MICCAI2025
X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at https://github.com/xiechun298/SV-DRR.
comment: Accepted by MICCAI2025
♻ ☆ UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens
Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation \textit{\textbf{personalized attribute-reasoning generation}}. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and attribute-reasoning generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized attribute-reasoning generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.
♻ ☆ A Synthetic Data-Driven Radiology Foundation Model for Pan-tumor Clinical Diagnosis
AI-assisted imaging made substantial advances in tumor diagnosis and management. However, a major barrier to developing robust oncology foundation models is the scarcity of large-scale, high-quality annotated datasets, which are limited by privacy restrictions and the high cost of manual labeling. To address this gap, we present PASTA, a pan-tumor radiology foundation model built on PASTA-Gen, a synthetic data framework that generated 30,000 3D CT scans with pixel-level lesion masks and structured reports of tumors across ten organ systems. Leveraging this resource, PASTA achieves state-of-the-art performance on 45 of 46 oncology tasks, including non-contrast CT tumor screening, lesion segmentation, structured reporting, tumor staging, survival prediction, and MRI-modality transfer. To assess clinical applicability, we developed PASTA-AID, a clinical decision support system, and ran a retrospective simulated clinical trial across two scenarios. For pan-tumor screening on plain CT with fixed reading time, PASTA-AID increased radiologists' throughput by 11.1-25.1% and improved sensitivity by 17.0-31.4% and precision by 10.5-24.9%; additionally, in a diagnosis-aid workflow, it reduced segmentation time by up to 78.2% and reporting time by up to 36.5%. Beyond gains in accuracy and efficiency, PASTA-AID narrowed the expertise gap, enabling less-experienced radiologists to approach expert-level performance. Together, this work establishes an end-to-end, synthetic data-driven pipeline spanning data generation, model development, and clinical validation, thereby demonstrating substantial potential for pan-tumor research and clinical translation.
comment: 63 pages, 7 figures
♻ ☆ Creative synthesis of kinematic mechanisms
In this paper, we formulate the problem of kinematic synthesis for planar linkages as a cross-domain image generation task. We develop a planar linkages dataset using RGB image representations, covering a range of mechanisms: from simple types such as crank-rocker and crank-slider to more complex eight-bar linkages like Jansen's mechanism. A shared-latent variational autoencoder (VAE) is employed to explore the potential of image generative models for synthesizing unseen motion curves and simulating novel kinematics. By encoding the drawing speed of trajectory points as color gradients, the same architecture also supports kinematic synthesis conditioned on both trajectory shape and velocity profiles. We validate our method on three datasets of increasing complexity: a standard four-bar linkage set, a mixed set of four-bar and crank-slider mechanisms, and a complex set including multi-loop mechanisms. Preliminary results demonstrate the effectiveness of image-based representations for generative mechanical design, showing that mechanisms with revolute and prismatic joints, and potentially cams and gears, can be represented and synthesized within a unified image generation framework.
comment: 6pages, 6 figures
♻ ☆ NanoHTNet: Nano Human Topology Network for Efficient 3D Human Pose Estimation
The widespread application of 3D human pose estimation (HPE) is limited by resource-constrained edge devices, requiring more efficient models. A key approach to enhancing efficiency involves designing networks based on the structural characteristics of input data. However, effectively utilizing the structural priors in human skeletal inputs remains challenging. To address this, we leverage both explicit and implicit spatio-temporal priors of the human body through innovative model design and a pre-training proxy task. First, we propose a Nano Human Topology Network (NanoHTNet), a tiny 3D HPE network with stacked Hierarchical Mixers to capture explicit features. Specifically, the spatial Hierarchical Mixer efficiently learns the human physical topology across multiple semantic levels, while the temporal Hierarchical Mixer with discrete cosine transform and low-pass filtering captures local instantaneous movements and global action coherence. Moreover, Efficient Temporal-Spatial Tokenization (ETST) is introduced to enhance spatio-temporal interaction and reduce computational complexity significantly. Second, PoseCLR is proposed as a general pre-training method based on contrastive learning for 3D HPE, aimed at extracting implicit representations of human topology. By aligning 2D poses from diverse viewpoints in the proxy task, PoseCLR aids 3D HPE encoders like NanoHTNet in more effectively capturing the high-dimensional features of the human body, leading to further performance improvements. Extensive experiments verify that NanoHTNet with PoseCLR outperforms other state-of-the-art methods in efficiency, making it ideal for deployment on edge devices like the Jetson Nano. Code and models are available at https://github.com/vefalun/NanoHTNet.
comment: Accepted by TIP 2025, Open Sourced
♻ ☆ Solving Oscillator Ordinary Differential Equations in the Time Domain with High Performance via Soft-constrained Physics-informed Neural Network with Small Data
In many scientific and engineering (e.g., physical, biochemical, medical) practices, data generated through expensive experiments or large-scale simulations, are often sparse and noisy. Physics-informed neural network (PINN) incorporates physical information and knowledge into network topology or computational processes as model priors, with the unique advantage of achieving strong generalization with small data. This study aims to investigate the performance characteristics of the soft-constrained PINN method to solving typical linear and nonlinear ordinary differential equations (ODEs) such as primer, Van der Pol and Duffing oscillators, especially the effectiveness, efficiency, and robustness to noise with minimal data. It is verified that the soft-constrained PINN significantly reduces the need for labeled data. With the aid of appropriate collocation points no need to be labeled, it can predict and also extrapolate with minimal data. First-order and second-order ODEs, no matter linear or nonlinear oscillators, require only one and two training data (containing initial values) respectively, just like classical analytic or Runge-Kutta methods, and with equivalent precision and comparable efficiency (fast training in seconds for scalar ODEs). Furthermore, it can conveniently impose a physical law (e.g., conservation of energy) constraint by adding a regularization term to the total loss function, improving the performance to deal with various complexities such as nonlinearity like Duffing. The DeepXDE-based PINN implementation is light code and can be efficiently trained on both GPU and CPU platforms. The mathematical and computational framework of this alternative and feasible PINN method to ODEs, can be easily extended to PDEs, etc., and is becoming a favorable catalyst for the era of Digital Twins.
comment: 17 pages, 7 figures, 2 tables, etc
♻ ☆ SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution
High-spatial-resolution hyperspectral images (HSI) are essential for applications such as remote sensing and medical imaging, yet HSI sensors inherently trade spatial detail for spectral richness. Fusing high-spatial-resolution multispectral images (HR-MSI) with low-spatial-resolution hyperspectral images (LR-HSI) is a promising route to recover fine spatial structures without sacrificing spectral fidelity. Most state-of-the-art methods for HSI-MSI fusion demand point spread function (PSF) calibration or ground truth high resolution HSI (HR-HSI), both of which are impractical to obtain in real world settings. We present SpectraLift, a fully self-supervised framework that fuses LR-HSI and HR-MSI inputs using only the MSI's Spectral Response Function (SRF). SpectraLift trains a lightweight per-pixel multi-layer perceptron (MLP) network using ($i$)~a synthetic low-spatial-resolution multispectral image (LR-MSI) obtained by applying the SRF to the LR-HSI as input, ($ii$)~the LR-HSI as the output, and ($iii$)~an $\ell_1$ spectral reconstruction loss between the estimated and true LR-HSI as the optimization objective. At inference, SpectraLift uses the trained network to map the HR-MSI pixel-wise into a HR-HSI estimate. SpectraLift converges in minutes, is agnostic to spatial blur and resolution, and outperforms state-of-the-art methods on PSNR, SAM, SSIM, and RMSE benchmarks.
♻ ☆ Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments
Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real-world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open-sourced at our project page.
♻ ☆ DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection NeurIPS 2025
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.
comment: 30 pages, 12 figures, 4 tables
♻ ☆ AI-Generated Video Detection via Perceptual Straightening NeurIPS 2025
The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.
comment: Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
♻ ☆ Principled Feature Disentanglement for High-Fidelity Unified Brain MRI Synthesis
Multisequence Magnetic Resonance Imaging (MRI) provides a more reliable diagnosis in clinical applications through complementary information across sequences. However, in practice, the absence of certain MR sequences is a common problem that can lead to inconsistent analysis results. In this work, we propose a novel unified framework for synthesizing multisequence MR images, called hybrid-fusion GAN (HF-GAN). The fundamental mechanism of this work is principled feature disentanglement, which aligns the design of the architecture with the complexity of the features. A powerful many-to-one stream is constructed for the extraction of complex complementary features, while utilizing parallel, one-to-one streams to process modality-specific information. These disentangled features are dynamically integrated into a common latent space by a channel attention-based fusion module (CAFF) and then transformed via a modality infuser to generate the target sequence. We validated our framework on public datasets of both healthy and pathological brain MRI. Quantitative and qualitative results show that HF-GAN achieves state-of-the-art performance, with our 2D slice-based framework notably outperforming a leading 3D volumetric model. Furthermore, the utilization of HF-GAN for data imputation substantially improves the performance of the downstream brain tumor segmentation task, demonstrating its clinical relevance.
comment: 14 pages, 9 figures
♻ ☆ Grounded Reinforcement Learning for Visual Reasoning
While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
comment: Project website: https://visually-grounded-rl.github.io/
♻ ☆ VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models NeurIPS 2025
This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input. All the resources are available at https://walkermitty.github.io/VimoRAG/
comment: Accepted by NeurIPS 2025; Project Page: https://walkermitty.github.io/VimoRAG
♻ ☆ When Words Smile: Generating Diverse Emotional Facial Expressions from Text
Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text-3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.
comment: Accepted by EMNLP 2025 (Oral); Project Page: https://walkermitty.github.io/EmoAva
♻ ☆ WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition
Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) signals.However, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.
comment: 6 pages, 3 figures, accepted to IEEE EMBS Conference on Neural Engineering (NER) 2025. Code and data are available at https://github.com/ForeverBlue816/WaveFormer
♻ ☆ ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
comment: ScreenCoder-v2
♻ ☆ Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles NeurIPS 2025
Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.
comment: NeurIPS 2025, Project page: https://nickisdope.github.io/Styl3R
♻ ☆ Predicting Patient Recovery or Mortality Using Deep Neural Decision Tree and Forest
Objective: Identifying patients at high risk of mortality is crucial for emergency physicians to allocate hospital resources effectively, particularly in regions with limited medical services. This need becomes even more pressing during global health crises that lead to significant morbidity and mortality. This study aimed to present the usability deep neural decision forest and deep neural decision tree to predict mortality among Coronavirus disease 2019 (COVID-19) patients. To this end, We used patient data encompassing Coronavirus disease 2019 diagnosis, demographics, health indicators, and occupational risk factors to analyze disease severity and outcomes. The dataset was partitioned using a stratified sampling method, ensuring that 80% was allocated for training and 20% for testing. Nine machine learning and deep learning methods were employed to build predictive models. The models were evaluated across all stages to determine their effectiveness in predicting patient outcomes. Results: Among the models, the deep neural decision forest consistently outperformed others. Results indicated that using only clinical data yielded an accuracy of 80% by deep neural decision forest, demonstrating it as a reliable predictor of patient mortality. Moreover, the results suggest that clinical data alone may be the most accurate diagnostic tool for predicting mortality.
♻ ☆ NeRF-based Visualization of 3D Cues Supporting Data-Driven Spacecraft Pose Estimation
On-orbit operations require the estimation of the relative 6D pose, i.e., position and orientation, between a chaser spacecraft and its target. While data-driven spacecraft pose estimation methods have been developed, their adoption in real missions is hampered by the lack of understanding of their decision process. This paper presents a method to visualize the 3D visual cues on which a given pose estimator relies. For this purpose, we train a NeRF-based image generator using the gradients back-propagated through the pose estimation network. This enforces the generator to render the main 3D features exploited by the spacecraft pose estimation network. Experiments demonstrate that our method recovers the relevant 3D cues. Furthermore, they offer additional insights on the relationship between the pose estimation network supervision and its implicit representation of the target spacecraft.
comment: Accepted at IEEE ISpaRo 2025 (International Conference on Space Robotics) (8 pages, 2 figures)
♻ ☆ FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions
While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model retraining with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose FairGen, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, FairGen consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process. Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for retraining. Extensive experiments on debiasing gender, racial, and their intersectional biases show that our method outperforms previous SOTA by a large margin.
♻ ☆ Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.
comment: Project: https://github.com/HiThink-Research/NEXUS-O
♻ ☆ What is Memory? A Homological Perspective
We introduce the delta-homology model of memory, a unified framework in which recall, learning, and prediction emerge from cycle closure, the completion of topologically constrained trajectories within the brain's latent manifold. A Dirac-like memory trace corresponds to a nontrivial homology generator, representing a sparse, irreducible attractor that reactivates only when inference trajectories close upon themselves. In this view, memory is not a static attractor landscape but a topological process of recurrence, where structure arises through the stabilization of closed loops. Building on this principle, we represent spike-timing dynamics as spatiotemporal complexes, in which temporally consistent transitions among neurons form chain complexes supporting persistent activation cycles. These cycles are organized into cell posets, compact causal representations that encode overlapping and compositional memory traces. Within this construction, learning and recall correspond to cycle closure under contextual modulation: inference trajectories stabilize into nontrivial homology classes when both local synchrony (context) and global recurrence (content) are satisfied. We formalize this mechanism through the Context-Content Uncertainty Principle (CCUP), which states that cognition minimizes joint uncertainty between a high-entropy context variable and a low-entropy content variable. Synchronization acts as a context filter selecting coherent subnetworks, while recurrence acts as a content filter validating nontrivial cycles.
♻ ☆ SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology CVPR 2025
With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.
comment: CVPR 2025, EarthVision workshop
♻ ☆ Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
comment: Withdrawn due to an accidental duplicate submission. This paper (arXiv:2510.15430) was unintentionally submitted as a new entry instead of a new version of our previous work (arXiv:2508.09201)
♻ ☆ Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
comment: 16 pages, 12 figures
♻ ☆ Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking
Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking leverages their strengths while mitigating these drawbacks. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing segmentation masks and tracking. Once initialized, this model directly extracts segmentation masks, processing image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments. Code, data, videos, and benchmarks are available at https://vlm-gist.github.io
comment: IEEE Robotics and Automation Letters (RA-L), November 2025
♻ ☆ Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging
The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions.
comment: Code and data: https://github.com/multimodallearning/MetaFormerMedImaging/tree/clean_code
♻ ☆ Accurate and Efficient Low-Rank Model Merging in Core Space NeurIPS 2025
In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.
comment: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, USA
♻ ☆ Periodontal Bone Loss Analysis via Keypoint Detection With Heuristic Post-Processing
This study proposes a deep learning framework and annotation methodology for the automatic detection of periodontal bone loss landmarks, associated conditions, and staging. 192 periapical radiographs were collected and annotated with a stage agnostic methodology, labelling clinically relevant landmarks regardless of disease presence or extent. We propose a heuristic post-processing module that aligns predicted keypoints to tooth boundaries using an auxiliary instance segmentation model. An evaluation metric, Percentage of Relative Correct Keypoints (PRCK), is proposed to capture keypoint performance in dental imaging domains. Four donor pose estimation models were adapted with fine-tuning for our keypoint problem. Post-processing improved fine-grained localisation, raising average PRCK^{0.05} by +0.028, but reduced coarse performance for PRCK^{0.25} by -0.0523 and PRCK^{0.5} by -0.0345. Orientation estimation shows excellent performance for auxiliary segmentation when filtered with either stage 1 object detection model. Periodontal staging was detected sufficiently, with the best mesial and distal Dice scores of 0.508 and 0.489, while furcation involvement and widened periodontal ligament space tasks remained challenging due to scarce positive samples. Scalability is implied with similar validation and external set performance. The annotation methodology enables stage agnostic training with balanced representation across disease severities for some detection tasks. The PRCK metric provides a domain-specific alternative to generic pose metrics, while the heuristic post-processing module consistently corrected implausible predictions with occasional catastrophic failures. The proposed framework demonstrates the feasibility of clinically interpretable periodontal bone loss assessment, with potential to reduce diagnostic variability and clinician workload.
comment: 18 pages, 7 tables, 9 figures, 1 equation, journal paper submitted to Computers in Biology and Medicine
♻ ☆ Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space
Given a video with $T$ frames, frame sampling is a task to select $N \ll T$ frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of $\binom{T}{N}$, especially when $N$ gets large. To address this challenge, we introduce a novel perspective of reducing the search space from $O(T^N)$ to $O(T)$. Instead of exploring the entire $O(T^N)$ space, our proposed semi-optimal policy selects the top $N$ frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of $N$ and $T$.
♻ ☆ Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems
♻ ☆ CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning
Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.
♻ ☆ DynVFX: Augmenting Real Videos with Dynamic Content
We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.
comment: Project page: https://dynvfx.github.io
♻ ☆ MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation
Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.
♻ ☆ PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization
Accurate 3D shape representation is essential in engineering applications such as design, optimization, and simulation. In practice, engineering workflows require structured, part-based representations, as objects are inherently designed as assemblies of distinct components. However, most existing methods either model shapes holistically or decompose them without predefined part structures, limiting their applicability in real-world design tasks. We propose PartSDF, a supervised implicit representation framework that explicitly models composite shapes with independent, controllable parts while maintaining shape consistency. Thanks to its simple but innovative architecture, PartSDF outperforms both supervised and unsupervised baselines in reconstruction and generation tasks. We further demonstrate its effectiveness as a structured shape prior for engineering applications, enabling precise control over individual components while preserving overall coherence. Code available at https://github.com/cvlab-epfl/PartSDF.
comment: Accepted to TMLR (33 pages, 22 figures)
♻ ☆ DARIL: When Imitation Learning outperforms Reinforcement Learning in Surgical Action Planning MICCAI2025
Surgical action planning requires predicting future instrument-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through self-exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL--world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.
comment: Paper accepted at the MICCAI2025 workshop proceedings on COLlaborative Intelligence and Autonomy in Image-guided Surgery (COLAS)
♻ ☆ Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR
Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \& layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.
♻ ☆ Vision Mamba for Permeability Prediction of Porous Media
Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.
♻ ☆ Learning Generalizable Shape Completion with SIM(3) Equivariance NeurIPS 2025
3D shape completion methods typically assume scans are pre-aligned to a canonical frame. This leaks pose and scale cues that networks may exploit to memorize absolute positions rather than inferring intrinsic geometry. When such alignment is absent in real data, performance collapses. We argue that robust generalization demands architectural equivariance to the similarity group, SIM(3), so the model remains agnostic to pose and scale. Following this principle, we introduce the first SIM(3)-equivariant shape completion network, whose modular layers successively canonicalize features, reason over similarity-invariant geometry, and restore the original frame. Under a de-biased evaluation protocol that removes the hidden cues, our model outperforms both equivariant and augmentation baselines on the PCN benchmark. It also sets new cross-domain records on real driving and indoor scans, lowering minimal matching distance on KITTI by 17% and Chamfer distance $\ell1$ on OmniObject3D by 14%. Perhaps surprisingly, ours under the stricter protocol still outperforms competitors under their biased settings. These results establish full SIM(3) equivariance as an effective route to truly generalizable shape completion. Project page: https://sime-completion.github.io.
comment: NeurIPS 2025
♻ ☆ Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization
We introduce Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity through a connectivity-preserving tokenization scheme. Unlike previous methods that predict bone positions represented as two joints or first predict points before determining connectivity, our method employs special tokens to define endpoints for each joint's children and for each hierarchical layer, effectively automating connectivity relationships. This approach significantly enhances topological accuracy by integrating connectivity information directly into the prediction framework. To further guarantee high-quality topology, we implement a topology-aware reward function that quantifies topological correctness, which is then utilized in a post-training phase through reward-guided Direct Preference Optimization. Additionally, we incorporate implicit geodesic features for latent top-k bone selection, which substantially improves skinning quality. By leveraging geodesic distance information within the model's latent space, our approach intelligently determines the most influential bones for each vertex, effectively mitigating common skinning artifacts. This combination of connectivity-preserving tokenization, reward-guided fine-tuning, and geodesic-aware bone selection enables our model to consistently generate more anatomically plausible skeletal structures with superior deformation properties.
♻ ☆ BokehDiff: Neural Lens Blur with One-Step Diffusion ICCV 2025
We introduce BokehDiff, a novel lens blur rendering method that achieves physically accurate and visually appealing outcomes, with the help of generative diffusion prior. Previous methods are bounded by the accuracy of depth estimation, generating artifacts in depth discontinuities. Our method employs a physics-inspired self-attention module that aligns with the image formation process, incorporating depth-dependent circle of confusion constraint and self-occlusion effects. We adapt the diffusion model to the one-step inference scheme without introducing additional noise, and achieve results of high quality and fidelity. To address the lack of scalable paired data, we propose to synthesize photorealistic foregrounds with transparency with diffusion models, balancing authenticity and scene diversity.
comment: Accepted by ICCV 2025
♻ ☆ Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning
Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these is often prohibitively expensive, as it requires complete vision-language alignment retraining which is costly. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.
♻ ☆ Hyperspectral Anomaly Detection Fused Unified Nonconvex Tensor Ring Factors Regularization
In recent years, tensor decomposition-based approaches for hyperspectral anomaly detection (HAD) have gained significant attention in the field of remote sensing. However, existing methods often fail to fully leverage both the global correlations and local smoothness of the background components in hyperspectral images (HSIs), which exist in both the spectral and spatial domains. This limitation results in suboptimal detection performance. To mitigate this critical issue, we put forward a novel HAD method named HAD-EUNTRFR, which incorporates an enhanced unified nonconvex tensor ring (TR) factors regularization. In the HAD-EUNTRFR framework, the raw HSIs are first decomposed into background and anomaly components. The TR decomposition is then employed to capture the spatial-spectral correlations within the background component. Additionally, we introduce a unified and efficient nonconvex regularizer, induced by tensor singular value decomposition (TSVD), to simultaneously encode the low-rankness and sparsity of the 3-D gradient TR factors into a unique concise form. The above characterization scheme enables the interpretable gradient TR factors to inherit the low-rankness and smoothness of the original background. To further enhance anomaly detection, we design a generalized nonconvex regularization term to exploit the group sparsity of the anomaly component. To solve the resulting doubly nonconvex model, we develop a highly efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) framework. Experimental results on several benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art (SOTA) approaches in terms of detection accuracy.
♻ ☆ Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering NeurIPS 2025
Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF
comment: Accepted by NeurIPS 2025
♻ ☆ Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
♻ ☆ DeepSeek-Inspired Exploration of RL-based LLMs and Synergy with Wireless Networks: A Survey
Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have attracted widespread attention for their remarkable capabilities in multimodal data understanding. Meanwhile, the rapid expansion of information services has led to a growing demand for AI-enabled wireless networks. The open-source DeepSeek models are famous for their innovative designs, such as large-scale pure RL and cost-efficient training, which make them well-suited for practical deployment in wireless networks. By integrating DeepSeek-style LLMs with wireless infrastructures, a synergistic opportunity arises: the DeepSeek-style LLMs enhance network optimization with strong reasoning and decision-making abilities, while wireless infrastructure enables the broad deployment of these models. Motivated by this convergence, this survey presents a comprehensive DeepSeek-inspired exploration of RL-based LLMs in the context of wireless networks. We begin by reviewing key techniques behind network optimization to establish a foundation for understanding DeepSeek-style LLM integration. Next, we examine recent advancements in RL-based LLMs, using DeepSeek models as a representative example. Building on this, we explore the synergy between the two domains, highlighting motivations, challenges, and potential solutions. Finally, we highlight emerging directions for integrating LLMs with wireless networks, such as quantum, on-device, and neural-symbolic LLM models, as well as embodied AI agents. Overall, this survey offers a comprehensive examination of the interplay between DeepSeek-style LLMs and wireless networks, demonstrating how these domains can mutually enhance each other to drive innovation.
comment: 45 pages, 12 figures
♻ ☆ Dress Well via Fashion Cognitive Learning
Fashion compatibility models enable online retailers to easily obtain a large number of outfit compositions with good quality. However, effective fashion recommendation demands precise service for each customer with a deeper cognition of fashion. In this paper, we conduct the first study on fashion cognitive learning, which is fashion recommendations conditioned on personal physical information. To this end, we propose a Fashion Cognitive Network (FCN) to learn the relationships among visual-semantic embedding of outfit composition and appearance features of individuals. FCN contains two submodules, namely outfit encoder and Multi-label Graph Neural Network (ML-GCN). The outfit encoder uses a convolutional layer to encode an outfit into an outfit embedding. The latter module learns label classifiers via stacked GCN. We conducted extensive experiments on the newly collected O4U dataset, and the results provide strong qualitative and quantitative evidence that our framework outperforms alternative methods.
♻ ☆ MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention
Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.
comment: 18 pages, 7 figures, 10 tables. Code available at https://github.com/TianyiFranklinWang/MIRROR. Project page: https://tianyifranklinwang.github.io/MIRROR
♻ ☆ NFIG: Multi-Scale Autoregressive Image Generation via Frequency Ordering
Autoregressive models have achieved significant success in image generation. However, unlike the inherent hierarchical structure of image information in the spectral domain, standard autoregressive methods typically generate pixels sequentially in a fixed spatial order. To better leverage this spectral hierarchy, we introduce NextFrequency Image Generation (NFIG). NFIG is a novel framework that decomposes the image generation process into multiple frequency-guided stages. NFIG aligns the generation process with the natural image structure. It does this by first generating low-frequency components, which efficiently capture global structure with significantly fewer tokens, and then progressively adding higher-frequency details. This frequency-aware paradigm offers substantial advantages: it not only improves the quality of generated images but crucially reduces inference cost by efficiently establishing global structure early on. Extensive experiments on the ImageNet-256 benchmark validate NFIG's effectiveness, demonstrating superior performance (FID: 2.81) and a notable 1.25x speedup compared to the strong baseline VAR-d20. The source code is available at https://github.com/Pride-Huang/NFIG.
comment: 10 pages, 7 figures, 2 tables
♻ ☆ FlowDet: Overcoming Perspective and Scale Challenges in Real-Time End-to-End Traffic Detection
End-to-end object detectors offer a promising NMS-free paradigm for real-time applications, yet their high computational cost remains a significant barrier, particularly for complex scenarios like intersection traffic monitoring. To address this challenge, we propose FlowDet, a high-speed detector featuring a decoupled encoder optimization strategy applied to the DETR architecture. Specifically, FlowDet employs a novel Geometric Deformable Unit (GDU) for traffic-aware geometric modeling and a Scale-Aware Attention (SAA) module to maintain high representational power across extreme scale variations. To rigorously evaluate the model's performance in environments with severe occlusion and high object density, we collected the Intersection-Flow-5k dataset, a new challenging scene for this task. Evaluated on Intersection-Flow-5k, FlowDet establishes a new state-of-the-art. Compared to the strong RT-DETR baseline, it improves AP(test) by 1.5% and AP50(test) by 1.6%, while simultaneously reducing GFLOPs by 63.2% and increasing inference speed by 16.2%. Our work demonstrates a new path towards building highly efficient and accurate detectors for demanding, real-world perception systems. The Intersection-Flow-5k dataset is available at https://github.com/AstronZh/Intersection-Flow-5K.
comment: Accepted by PRCV 2025. Project page with code and dataset: https://github.com/AstronZh/Intersection-Flow-5K
♻ ☆ Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection
Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.
comment: Accepted by ACMMM2025
♻ ☆ REACT-KD: Region-Aware Cross-modal Topological Knowledge Distillation for Interpretable Medical Image Classification
Reliable and interpretable tumor classification from clinical imaging remains a core challenge. The main difficulties arise from heterogeneous modality quality, limited annotations, and the absence of structured anatomical guidance. We present REACT-KD, a Region-Aware Cross-modal Topological Knowledge Distillation framework that transfers supervision from high-fidelity multi-modal sources into a lightweight CT-based student model. The framework employs a dual teacher design. One branch captures structure-function relationships through dual-tracer PET/CT, while the other models dose-aware features using synthetically degraded low-dose CT. These branches jointly guide the student model through two complementary objectives. The first achieves semantic alignment through logits distillation, and the second models anatomical topology through region graph distillation. A shared CBAM3D module ensures consistent attention across modalities. To improve reliability in deployment, REACT-KD introduces modality dropout during training, which enables robust inference under partial or noisy inputs. As a case study, we applied REACT-KD to hepatocellular carcinoma staging. The framework achieved an average AUC of 93.5\% on an internal PET/CT cohort and maintained 76.6\% to 81.5\% AUC across varying levels of dose degradation in external CT testing. Decision curve analysis further shows that REACT-KD consistently provides the highest net clinical benefit across all thresholds, confirming its value in real-world diagnostic practice. Code is available at: https://github.com/Kinetics-JOJO/REACT-KD
♻ ☆ Patch of Invisibility: Naturalistic Physical Black-Box Adversarial Attacks on Object Detectors ECML
Adversarial attacks on deep learning models have received increased attention in recent years. Work in this area has mostly focused on gradient-based techniques, so-called 'white-box' attacks, where the attacker has access to the targeted model's internal parameters; such an assumption is usually untenable in the real world. Additionally, some attacks use the entire pixel space to fool a given model, which is neither practical nor physical. To accommodate these problems we propose the BBNP algorithm (Black-Box Naturalistic Patch): a direct, black-box, naturalistic, gradient-free method that uses the learned image manifold of a pretrained, generative adversarial network (GAN) to generate naturalistic adversarial patches for object detectors. This method performs model-agnostic black-box naturalistic attacks on object detection models by relying solely on the outputs of the model. Comparing our approach against five models, five black-box and two white-box attacks, we show that our proposed method achieves state-of-the-art results, outperforming all other tested black-box approaches.
comment: Accepted at MLCS @ ECML-PKDD 2024
♻ ☆ Large Language Model-Guided Semantic Alignment for Human Activity Recognition
Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors is critical for applications in healthcare, safety, and industrial production. However, variations in activity patterns, device types, and sensor placements create distribution gaps across datasets, reducing the performance of HAR models. To address this, we propose LanHAR, a novel system that leverages Large Language Models (LLMs) to generate semantic interpretations of sensor readings and activity labels for cross-dataset HAR. This approach not only mitigates cross-dataset heterogeneity but also enhances the recognition of new activities. LanHAR employs an iterative re-generation method to produce high-quality semantic interpretations with LLMs and a two-stage training framework that bridges the semantic interpretations of sensor readings and activity labels. This ultimately leads to a lightweight sensor encoder suitable for mobile deployment, enabling any sensor reading to be mapped into the semantic interpretation space. Experiments on five public datasets demonstrate that our approach significantly outperforms state-of-the-art methods in both cross-dataset HAR and new activity recognition. The source code is publicly available at https://github.com/DASHLab/LanHAR.
♻ ☆ From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging NeurIPS 2025
Face aging has become a crucial task in computer vision, with applications ranging from entertainment to healthcare. However, existing methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. The core challenge lies in balancing age accuracy and identity preservation--what we refer to as the Age-ID trade-off. Most prior methods either prioritize age transformation at the expense of identity consistency or vice versa. In this work, we address this issue by proposing a two-pass face aging framework, named Cradle2Cane, based on few-step text-to-image (T2I) diffusion models. The first pass focuses on solving age accuracy by introducing an adaptive noise injection (AdaNI) mechanism. This mechanism is guided by including prompt descriptions of age and gender for the given person as the textual condition. Also, by adjusting the noise level, we can control the strength of aging while allowing more flexibility in transforming the face. However, identity preservation is weakly ensured here to facilitate stronger age transformations. In the second pass, we enhance identity preservation while maintaining age-specific features by conditioning the model on two identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP. This pass allows for denoising the transformed image from the first pass, ensuring stronger identity preservation without compromising the aging accuracy. Both passes are jointly trained in an end-to-end way. Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that our Cradle2Cane outperforms existing face aging methods in age accuracy and identity consistency. Code is available at https://github.com/byliutao/Cradle2Cane.
comment: 32 pages, 12 figures, NeurIPS 2025 Poster
♻ ☆ The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA ICCV 2025
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA (SaSaSa2VA) to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $\mathcal{J\&F}$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA.
comment: The 1st place report of 7th LSVOS challenge RVOS track in ICCV 2025. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA
♻ ☆ Adaptive Convolutional Neural Network for Image Super-resolution
Convolutional neural networks can automatically learn features via deep network architectures and given input samples. However, the robustness of obtained models may face challenges in varying scenes. Bigger differences in network architecture are beneficial to extract more diversified structural information to strengthen the robustness of an obtained super-resolution model. In this paper, we proposed a adaptive convolutional neural network for image super-resolution (ADSRNet). To capture more information, ADSRNet is implemented by a heterogeneous parallel network. The upper network can enhance relation of context information, salient information relation of a kernel mapping and relations of shallow and deep layers to improve performance of image super-resolution. That can strengthen adaptability of an obtained super-resolution model for different scenes. The lower network utilizes a symmetric architecture to enhance relations of different layers to mine more structural information, which is complementary with a upper network for image super-resolution. The relevant experimental results show that the proposed ADSRNet is effective to deal with image resolving. Codes are obtained at https://github.com/hellloxiaotian/ADSRNet.
comment: 11pages, 7 figures
♻ ☆ OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates
Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models are available at https://github.com/jp-guo/OSCAR.
♻ ☆ MotionGPT3: Human Motion as a Second Modality
With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.
comment: 26 pages, 11 figures
♻ ☆ SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization
Text-to-motion generation is essential for advancing the creative industry but often presents challenges in producing consistent, realistic motions. To address this, we focus on fine-tuning text-to-motion models to consistently favor high-quality, human-preferred motions, a critical yet largely unexplored problem. In this work, we theoretically investigate the DPO under both online and offline settings, and reveal their respective limitation: overfitting in offline DPO, and biased sampling in online DPO. Building on our theoretical insights, we introduce Semi-online Preference Optimization (SoPo), a DPO-based method for training text-to-motion models using "semi-online" data pair, consisting of unpreferred motion from online distribution and preferred motion in offline datasets. This method leverages both online and offline DPO, allowing each to compensate for the other's limitations. Extensive experiments demonstrate that SoPo outperforms other preference alignment methods, with an MM-Dist of 3.25% (vs e.g. 0.76% of MoDiPO) on the MLD model, 2.91% (vs e.g. 0.66% of MoDiPO) on MDM model, respectively. Additionally, the MLD model fine-tuned by our SoPo surpasses the SoTA model in terms of R-precision and MM Dist. Visualization results also show the efficacy of our SoPo in preference alignment. Project page: https://xiaofeng-tan.github.io/projects/SoPo/ .
♻ ☆ Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion
Multimodal learning (MML) is significantly constrained by modality imbalance, leading to suboptimal performance in practice. While existing approaches primarily focus on balancing the learning of different modalities to address this issue, they fundamentally overlook the inherent disproportion in model classification ability, which serves as the primary cause of this phenomenon. In this paper, we propose a novel multimodal learning approach to dynamically balance the classification ability of weak and strong modalities by incorporating the principle of boosting. Concretely, we first propose a sustained boosting algorithm in multimodal learning by simultaneously optimizing the classification and residual errors. Subsequently, we introduce an adaptive classifier assignment strategy to dynamically facilitate the classification performance of the weak modality. Furthermore, we theoretically analyze the convergence property of the cross-modal gap function, ensuring the effectiveness of the proposed boosting scheme. To this end, the classification ability of strong and weak modalities is expected to be balanced, thereby mitigating the imbalance issue. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines. The source code is available at https://github.com/njustkmg/NeurIPS25-AUG.
♻ ☆ Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation
Cross-Domain Sequential Recommendation (CDSR) plays a crucial role in modern consumer electronics and e-commerce platforms, where users interact with diverse services such as books, movies, and online retail products. These systems must accurately capture both domain-specific and cross-domain behavioral patterns to provide personalized and seamless consumer experiences. To address this challenge, we propose \textbf{TEMA-LLM} (\textit{Tag-Enriched Multi-Attention with Large Language Models}), a practical and effective framework that integrates \textit{Large Language Models (LLMs)} for semantic tag generation and enrichment. Specifically, TEMA-LLM employs LLMs to assign domain-aware prompts and generate descriptive tags from item titles and descriptions. The resulting tag embeddings are fused with item identifiers as well as textual and visual features to construct enhanced item representations. A \textit{Tag-Enriched Multi-Attention} mechanism is then introduced to jointly model user preferences within and across domains, enabling the system to capture complex and evolving consumer interests. Extensive experiments on four large-scale e-commerce datasets demonstrate that TEMA-LLM consistently outperforms state-of-the-art baselines, underscoring the benefits of LLM-based semantic tagging and multi-attention integration for consumer-facing recommendation systems. The proposed approach highlights the potential of LLMs to advance intelligent, user-centric services in the field of consumer electronics.
comment: Accepted in IEEE Transactions on Consumer Electronics 2025
♻ ☆ VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles
comment: 13 pages, 4 figures
♻ ☆ Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs NeurIPS 2025
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
comment: Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page: https://liuxuannan.github.io/Video-SafetyBench.github.io/
♻ ☆ Latent Diffusion Model without Variational Autoencoder
Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.
♻ ☆ Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!
Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.
♻ ☆ CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records ICCV 2025
Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at https://github.com/BCV-Uniandes/Cardium, and at the project website https://bcv-uniandes.github.io/CardiumPage/
comment: Accepted to CVAMD Workshop, ICCV 2025
♻ ☆ Geodesic Diffusion Models for Efficient Medical Image Enhancement
Diffusion models generate data by learning to reverse a forward process, where samples are progressively perturbed with Gaussian noise according to a predefined noise schedule. From a geometric perspective, each noise schedule corresponds to a unique trajectory in probability space from the data distribution to a Gaussian prior. However, prior diffusion models rely on empirically chosen schedules that may not be optimal. This inefficiency necessitates many intermediate time steps, resulting in high computational costs during both training and sampling. To address this, we derive a family of geodesic noise schedules corresponding to the shortest paths in probability space under the Fisher-Rao metric. Based on these schedules, we propose Geodesic Diffusion Models (GDMs), which significantly improve training and sampling efficiency by minimizing the energy required to transform between probability distributions. This efficiency further enables sampling to start from an intermediate distribution in conditional image generation, achieving state-of-the-art results with as few as 6 steps. We evaluated GDM on two medical image enhancement tasks: CT image denoising and MRI image super-resolution. Experimental results show that GDM achieved state-of-the-art performance while reducing training time by 20- to 30-fold compared to Denoising Diffusion Probabilistic Models (DDPMs) and 4- to 6-fold compared to Fast-DDPM, and accelerating sampling by 160- to 170-fold and 1.6-fold, respectively. These gains support the use of GDM for efficient model development and real-time clinical applications. Our code is publicly available at: https://github.com/mirthAI/GDM-VE.
♻ ☆ Attention (as Discrete-Time Markov) Chains NeurIPS 2025
We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our key observation is that tokens linked to semantically similar regions form metastable states, i.e., regions where attention tends to concentrate, while noisy attention scores dissipate. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures global token importance. We show that TokenRank enhances unconditional image generation, improving both quality (IS) and diversity (FID), and can also be incorporated into existing segmentation techniques to improve their performance over existing benchmarks. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.
comment: NeurIPS 2025. Project page: https://yoterel.github.io/attention_chains/
♻ ☆ Indoor Heat Estimation from a Single Visible-Light Panorama
This paper introduces a novel image-based rendering technique for jointly estimating indoor lighting and thermal conditions from paired indoor-outdoor high dynamic range (HDR) panoramas. Our method uses the indoor panorama to estimate the 3D floor layout, while the corresponding outdoor panorama serves as an environment map to infer spatially-varying illumination and material properties. Assuming indoor surfaces are Lambertian and that all heat originates from outdoor visible light, we model the relationship between light transport and heat transfer, and perform transient heat simulations to generate indoor temperature distributions. The simulated heat maps are validated against real-world thermal images captured with an infrared camera. This approach supports photorealistic and physically informed visualization, enabling integrated light and heat estimation to advance traditional virtual home staging.
comment: ISVC'25, Selected for Oral Presentation
♻ ☆ HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization
We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.
comment: Project page: https://hoidini.github.io
♻ ☆ COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification
We introduce CoLoRA (Convolutional Low-Rank Adaptation), a parameter-efficient fine-tuning method for convolutional neural networks (CNNs). CoLoRA extends LoRA to convolutional layers by decomposing kernel updates into lightweight depthwise and pointwise components.This design reduces the number of trainable parameters to 0.2 compared to conventional fine-tuning, preserves the original model size, and allows merging updates into the pretrained weights after each epoch, keeping inference complexity unchanged. On OCTMNISTv2, CoLoRA applied to VGG16 and ResNet50 achieves up to 1 percent accuracy and 0.013 AUC improvements over strong baselines (Vision Transformers, state-space, and Kolmogorov Arnold models) while reducing per-epoch training time by nearly 20 percent. Results indicate that CoLoRA provides a stable and effective alternative to full fine-tuning for medical image classification.
comment: 15 pages, 13 figures
♻ ☆ VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture
In modern smart agriculture, object detection plays a crucial role by enabling automation, precision farming, and monitoring of resources. From identifying crop health and pest infestations to optimizing harvesting processes, accurate object detection enhances both productivity and sustainability. However, training object detection models often requires large-scale data collection and raises privacy concerns, particularly when sensitive agricultural data is distributed across farms. To address these challenges, we propose VLLFL, a vision-language model-based lightweight federated learning framework (VLLFL). It harnesses the generalization and context-aware detection capabilities of the vision-language model (VLM) and leverages the privacy-preserving nature of federated learning. By training a compact prompt generator to boost the performance of the VLM deployed across different farms, VLLFL preserves privacy while reducing communication overhead. Experimental results demonstrate that VLLFL achieves 14.53% improvement in the performance of VLM while reducing 99.3% communication overhead. Spanning tasks from identifying a wide variety of fruits to detecting harmful animals in agriculture, the proposed framework offers an efficient, scalable, and privacy-preserving solution specifically tailored to agricultural applications.
♻ ☆ The Impact of Coreset Selection on Spurious Correlations and Group Robustness
Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.
comment: 10 pages, 9 additional pages for Appendix
♻ ☆ Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation
Separating moving and static objects from a moving camera viewpoint is essential for 3D reconstruction, autonomous navigation, and scene understanding in robotics. Existing approaches often rely primarily on optical flow, which struggles to detect moving objects in complex, structured scenes involving camera motion. To address this limitation, we propose Focus of Expansion Likelihood and Segmentation (FoELS), a method based on the core idea of integrating both optical flow and texture information. FoELS computes the focus of expansion (FoE) from optical flow and derives an initial motion likelihood from the outliers of the FoE computation. This likelihood is then fused with a segmentation-based prior to estimate the final moving probability. The method effectively handles challenges including complex structured scenes, rotational camera motion, and parallel motion. Comprehensive evaluations on the DAVIS 2016 dataset and real-world traffic videos demonstrate its effectiveness and state-of-the-art performance.
comment: 8 pages, 15 figures, RA-L submission
♻ ☆ 3D Audio-Visual Segmentation NeurIPS 2024
Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://x-up-lab.github.io/research/3d-audio-visual-segmentation/
comment: Accepted at the NeurIPS 2024 Workshop on Audio Imagination; this version updates the project page link
♻ ☆ FlySearch: Exploring how vision-language models explore NeurIPS 2025
The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.
comment: NeurIPS 2025 Datasets and Benchmarks track
♻ ☆ TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.
comment: EMNLP2025 Main, 28 pages, 11 figures, 19 tables
♻ ☆ Monitoring morphometric drift in lifelong learning segmentation of the spinal cord
Morphometric measures derived from spinal cord segmentations can serve as diagnostic and prognostic biomarkers in neurological diseases and injuries affecting the spinal cord. While robust, automatic segmentation methods to a wide variety of contrasts and pathologies have been developed over the past few years, whether their predictions are stable as the model is updated using new datasets has not been assessed. This is particularly important for deriving normative values from healthy participants. In this study, we present a spinal cord segmentation model trained on a multisite $(n=75)$ dataset, including 9 different MRI contrasts and several spinal cord pathologies. We also introduce a lifelong learning framework to automatically monitor the morphometric drift as the model is updated using additional datasets. The framework is triggered by an automatic GitHub Actions workflow every time a new model is created, recording the morphometric values derived from the model's predictions over time. As a real-world application of the proposed framework, we employed the spinal cord segmentation model to update a recently-introduced normative database of healthy participants containing commonly used measures of spinal cord morphometry. Results showed that: (i) our model outperforms previous versions and pathology-specific models on challenging lumbar spinal cord cases, achieving an average Dice score of $0.95 \pm 0.03$; (ii) the automatic workflow for monitoring morphometric drift provides a quick feedback loop for developing future segmentation models; and (iii) the scaling factor required to update the database of morphometric measures is nearly constant among slices across the given vertebral levels, showing minimum drift between the current and previous versions of the model monitored by the framework. The code and model are open-source and accessible via Spinal Cord Toolbox v7.0.
comment: Under review (after 1st round of revision) at Imaging Neuroscience journal
♻ ☆ Predicting Video Slot Attention Queries from Random Slot-Feature Pairs
Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code and training logs are available on https://github.com/Genera1Z/RandSF.Q.
♻ ☆ REOrdering Patches Improves Vision Models NeurIPS 2025
Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
comment: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs
This study evaluates whether integrating curriculum learning with diffusion-based synthetic augmentation can enhance the detection of difficult pulmonary nodules in chest radiographs, particularly those with low size, brightness, and contrast, which often challenge conventional AI models due to data imbalance and limited annotation. A Faster R-CNN with a Feature Pyramid Network (FPN) backbone was trained on a hybrid dataset comprising expert-labeled NODE21 (1,213 patients; 52.4 percent male; mean age 63.2 +/- 11.5 years), VinDr-CXR, CheXpert, and 11,206 DDPM-generated synthetic images. Difficulty scores based on size, brightness, and contrast guided curriculum learning. Performance was compared to a non-curriculum baseline using mean average precision (mAP), Dice score, and area under the curve (AUC). Statistical tests included bootstrapped confidence intervals, DeLong tests, and paired t-tests. The curriculum model achieved a mean AUC of 0.95 versus 0.89 for the baseline (p < 0.001), with improvements in sensitivity (70 percent vs. 48 percent) and accuracy (82 percent vs. 70 percent). Stratified analysis demonstrated consistent gains across all difficulty bins (Easy to Very Hard). Grad-CAM visualizations confirmed more anatomically focused attention under curriculum learning. These results suggest that curriculum-guided synthetic augmentation enhances model robustness and generalization for pulmonary nodule detection.
comment: This version has been withdrawn due to authorship changes and a decision to substantially revise the manuscript with new methodology. A future version may be submitted separately
♻ ☆ Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection
Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren't universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model's OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.
♻ ☆ DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response
Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate a remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: 1) Multi-hazard: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. 2)Multi-sensor: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. 3) Multi-task: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM's reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements across all tasks, with robust cross-sensor and cross-disaster generalization capabilities. The code and data are available at: https://github.com/Junjue-Wang/DisasterM3.
comment: A multi-hazard, multi-sensor, and multi-task vision-language dataset for global-scale disaster assessment and response
♻ ☆ SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction NeurIPS 2025
Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at https://ipro-alimama.github.io/.
♻ ☆ Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.
comment: author comment: This version was previously removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. The authors have now obtained the necessary permissions, and the paper is resubmitted accordingly
♻ ☆ Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation
Robot learning of manipulation skills is hindered by the scarcity of diverse, unbiased datasets. While curated datasets can help, challenges remain in generalizability and real-world transfer. Meanwhile, large-scale "in-the-wild" video datasets have driven progress in computer vision through self-supervised techniques. Translating this to robotics, recent works have explored learning manipulation skills by passively watching abundant videos sourced online. Showing promising results, such video-based learning paradigms provide scalable supervision while reducing dataset bias. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources, as well as emerging techniques for acquiring robot manipulation skills from uncontrolled video demonstrations. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation. The survey summarizes video-based learning approaches, analyses their benefits over standard datasets, survey metrics, and benchmarks, and discusses open challenges and future directions in this nascent domain at the intersection of computer vision, natural language processing, and robot learning.
comment: Published at IEEE Access
♻ ☆ RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation
In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.
♻ ☆ Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
comment: 16 pages; Previously this version appeared as arXiv:2510.15430 which was submitted as a new work by accident
Machine Learning 297
☆ Unbiased Gradient Low-Rank Projection
Memory-efficient optimization is critical for training increasingly large language models (LLMs). A popular strategy involves gradient low-rank projection, storing only the projected optimizer states, with GaLore being a representative example. However, a significant drawback of many such methods is their lack of convergence guarantees, as various low-rank projection approaches introduce inherent biases relative to the original optimization algorithms, which contribute to performance gaps compared to full-parameter training. Aiming to tackle this problem, this paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms. In particular, an instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore's mechanism and the Muon algorithm, named GaLore Unbiased with Muon (GUM). We theoretically prove our method matches the convergence guarantees of the base Muon algorithm while preserving the memory efficiency of low-rank techniques. Empirical experiments on LLM fine-tuning and pretraining also demonstrate non-trivial improvements over GaLore and even better performance than full-parameter training. Further investigation shows that the improvement of this technique comes from a more uniform distribution of knowledge inside layers, leading to more efficient utilization of the model parameter space and better memorization.
☆ Glyph: Scaling Context Windows via Visual-Text Compression
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
☆ Executable Knowledge Graphs for Replicating AI Research
Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code will released at https://github.com/zjunlp/xKG.
comment: Work in progress
☆ Functional Distribution Networks (FDN) ICLR 2026
Modern probabilistic regressors often remain overconfident under distribution shift. We present Functional Distribution Networks (FDN), an input-conditioned distribution over network weights that induces predictive mixtures whose dispersion adapts to the input. FDN is trained with a beta-ELBO and Monte Carlo sampling. We further propose an evaluation protocol that cleanly separates interpolation from extrapolation and stresses OOD sanity checks (e.g., that predictive likelihood degrades under shift while in-distribution accuracy and calibration are maintained). On standard regression tasks, we benchmark against strong Bayesian, ensemble, dropout, and hypernetwork baselines under matched parameter and update budgets, and assess accuracy, calibration, and shift-awareness with standard diagnostics. Together, the framework and protocol aim to make OOD-aware, well-calibrated neural regression practical and modular.
comment: Submitted to ICLR 2026. Code will be released upon acceptance
☆ Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
comment: 29 pages, 9 tables, 6 figures
☆ SoftMimic: Learning Compliant Whole-body Control from Examples
We introduce SoftMimic, a framework for learning compliant whole-body control policies for humanoid robots from example motions. Imitating human motions with reinforcement learning allows humanoids to quickly learn new skills, but existing methods incentivize stiff control that aggressively corrects deviations from a reference motion, leading to brittle and unsafe behavior when the robot encounters unexpected contacts. In contrast, SoftMimic enables robots to respond compliantly to external forces while maintaining balance and posture. Our approach leverages an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which we use to train a reinforcement learning policy. By rewarding the policy for matching compliant responses rather than rigidly tracking the reference motion, SoftMimic learns to absorb disturbances and generalize to varied tasks from a single motion clip. We validate our method through simulations and real-world experiments, demonstrating safe and effective interaction with the environment.
comment: Website: https://gmargo11.github.io/softmimic/
☆ Inference-Time Compute Scaling For Flow Matching
Allocating extra computation at inference time has recently improved sample quality in large language models and diffusion-based image generation. In parallel, Flow Matching (FM) has gained traction in language, vision, and scientific domains, but inference-time scaling methods for it remain under-explored. Concurrently, Kim et al., 2025 approach this problem but replace the linear interpolant with a non-linear variance-preserving (VP) interpolant at inference, sacrificing FM's efficient and straight sampling. Additionally, inference-time compute scaling for flow matching has only been applied to visual tasks, like image generation. We introduce novel inference-time scaling procedures for FM that preserve the linear interpolant during sampling. Evaluations of our method on image generation, and for the first time (to the best of our knowledge), unconditional protein generation, show that I) sample quality consistently improves as inference compute increases, and II) flow matching inference-time scaling can be applied to scientific domains.
☆ Mapping Post-Training Forgetting in Language Models at Scale
Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not "average out" by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1->0 transitions (correct before post-training, incorrect after) to quantify forgetting and 0->1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple-choice benchmarks, we add chance-adjusted variants that subtract the expected contribution of random guessing from pre- and post-training accuracies. We apply this framework across post-training stages, model sizes, and data scales. Our large-scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction-tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post-training alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.
comment: 43 pages,15 figures
☆ Atlas-based Manifold Representations for Interpretable Riemannian Machine Learning
Despite the popularity of the manifold hypothesis, current manifold-learning methods do not support machine learning directly on the latent $d$-dimensional data manifold, as they primarily aim to perform dimensionality reduction into $\mathbb{R}^D$, losing key manifold features when the embedding dimension $D$ approaches $d$. On the other hand, methods that directly learn the latent manifold as a differentiable atlas have been relatively underexplored. In this paper, we aim to give a proof of concept of the effectiveness and potential of atlas-based methods. To this end, we implement a generic data structure to maintain a differentiable atlas that enables Riemannian optimization over the manifold. We complement this with an unsupervised heuristic that learns a differentiable atlas from point cloud data. We experimentally demonstrate that this approach has advantages in terms of efficiency and accuracy in selected settings. Moreover, in a supervised classification task over the Klein bottle and in RNA velocity analysis of hematopoietic data, we showcase the improved interpretability and robustness of our approach.
☆ VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.
comment: 18 pages, 7 Figures,
☆ Prediction of Sea Ice Velocity and Concentration in the Arctic Ocean using Physics-informed Neural Network
As an increasing amount of remote sensing data becomes available in the Arctic Ocean, data-driven machine learning (ML) techniques are becoming widely used to predict sea ice velocity (SIV) and sea ice concentration (SIC). However, fully data-driven ML models have limitations in generalizability and physical consistency due to their excessive reliance on the quantity and quality of training data. In particular, as Arctic sea ice entered a new phase with thinner ice and accelerated melting, there is a possibility that an ML model trained with historical sea ice data cannot fully represent the dynamically changing sea ice conditions in the future. In this study, we develop physics-informed neural network (PINN) strategies to integrate physical knowledge of sea ice into the ML model. Based on the Hierarchical Information-sharing U-net (HIS-Unet) architecture, we incorporate the physics loss function and the activation function to produce physically plausible SIV and SIC outputs. Our PINN model outperforms the fully data-driven model in the daily predictions of SIV and SIC, even when trained with a small number of samples. The PINN approach particularly improves SIC predictions in melting and early freezing seasons and near fast-moving ice regions.
comment: 49 pages, 7 figures, submitted to Environmental Modelling & Software
Efficient Tensor Completion Algorithms for Highly Oscillatory Operators
This paper presents low-complexity tensor completion algorithms and their efficient implementation to reconstruct highly oscillatory operators discretized as $n\times n$ matrices. The underlying tensor decomposition is based on the reshaping of the input matrix and its butterfly decomposition into an order $\mathcal{O} (\log n)$ tensor. The reshaping of the input matrix into a tensor allows for representation of the butterfly decomposition as a tensor decomposition with dense tensors. This leads to efficient utilization of the existing software infrastructure for dense and sparse tensor computations. We propose two tensor completion algorithms in the butterfly format, using alternating least squares and gradient-based optimization, as well as a novel strategy that uses low-rank matrix completion to efficiently generate an initial guess for the proposed algorithms. To demonstrate the efficiency and applicability of our proposed algorithms, we perform three numerical experiments using simulated oscillatory operators in seismic applications. In these experiments, we use $\mathcal {O} (n \log n)$ observed entries in the input matrix and demonstrate an $\mathcal{O}(n\log^3 n)$ computational cost of the proposed algorithms, leading to a speedup of orders of magnitudes per iteration for large matrices compared to the low-rank matrix and quantized tensor-train completion. Moreover, the proposed butterfly completion algorithms, equipped with the novel initial guess generation strategy, achieve reconstruction errors that are smaller by an order of magnitude, enabling accurate recovery of the underlying structure compared to the state-of-the-art completion algorithms.
☆ Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
☆ Enabling Fine-Grained Operating Points for Black-Box LLMs
Black-box Large Language Models (LLMs) provide practical and accessible alternatives to other machine learning methods, as they require minimal labeled data and machine learning expertise to develop solutions for various decision making problems. However, for applications that need operating with constraints on specific metrics (e.g., precision $\geq$ 95%), decision making with black-box LLMs remains unfavorable, due to their low numerical output cardinalities. This results in limited control over their operating points, preventing fine-grained adjustment of their decision making behavior. In this paper, we study using black-box LLMs as classifiers, focusing on efficiently improving their operational granularity without performance loss. Specifically, we first investigate the reasons behind their low-cardinality numerical outputs and show that they are biased towards generating rounded but informative verbalized probabilities. Then, we experiment with standard prompt engineering, uncertainty estimation and confidence elicitation techniques, and observe that they do not effectively improve operational granularity without sacrificing performance or increasing inference cost. Finally, we propose efficient approaches to significantly increase the number and diversity of available operating points. Our proposed approaches provide finer-grained operating points and achieve comparable to or better performance than the benchmark methods across 11 datasets and 3 LLMs.
comment: 35 pages
☆ AcademicEval: Live Long-Context LLM Benchmark
Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval
comment: Accepted by TMLR. Code is available at https://github.com/ulab-uiuc/AcademicEval
☆ The Marked Edge Walk: A Novel MCMC Algorithm for Sampling of Graph Partitions
Novel Markov Chain Monte Carlo (MCMC) methods have enabled the generation of large ensembles of redistricting plans through graph partitioning. However, existing algorithms such as Reversible Recombination (RevReCom) and Metropolized Forest Recombination (MFR) are constrained to sampling from distributions related to spanning trees. We introduce the marked edge walk (MEW), a novel MCMC algorithm for sampling from the space of graph partitions under a tunable distribution. The walk operates on the space of spanning trees with marked edges, allowing for calculable transition probabilities for use in the Metropolis-Hastings algorithm. Empirical results on real-world dual graphs show convergence under target distributions unrelated to spanning trees. For this reason, MEW represents an advancement in flexible ensemble generation.
☆ Closing the Sim2Real Performance Gap in RL
Sim2Real aims at training policies in high-fidelity simulation environments and effectively transferring them to the real world. Despite the developments of accurate simulators and Sim2Real RL approaches, the policies trained purely in simulation often suffer significant performance drops when deployed in real environments. This drop is referred to as the Sim2Real performance gap. Current Sim2Real RL methods optimize the simulator accuracy and variability as proxies for real-world performance. However, these metrics do not necessarily correlate with the real-world performance of the policy as established theoretically and empirically in the literature. We propose a novel framework to address this issue by directly adapting the simulator parameters based on real-world performance. We frame this problem as a bi-level RL framework: the inner-level RL trains a policy purely in simulation, and the outer-level RL adapts the simulation model and in-sim reward parameters to maximize real-world performance of the in-sim policy. We derive and validate in simple examples the mathematical tools needed to develop bi-level RL algorithms that close the Sim2Real performance gap.
☆ GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver
While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at https://github.com/3145tttt/GAS.
☆ A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning NeurIPS 2025
Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing mechanisms to coordinate agents most relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce interaction paradigms that leverage MAIDs to analyze and visualize existing approaches in MARL. Then, we design a new interaction paradigm based on MAIDs, referred to as targeted intervention that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In our implementation, we introduce a causal inference technique-referred to as Pre-Strategy Intervention (PSI)-to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.
comment: Accepted to NeurIPS 2025
Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning
This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.
comment: Dissertation
☆ Quantum Synthetic Data Generation for Industrial Bioprocess Monitoring
Data scarcity and sparsity in bio-manufacturing poses challenges for accurate model development, process monitoring, and optimization. We aim to replicate and capture the complex dynamics of industrial bioprocesses by proposing the use of a Quantum Wasserstein Generative Adversarial Network with Gradient Penalty (QWGAN-GP) to generate synthetic time series data for industrially relevant processes. The generator within our GAN is comprised of a Parameterized Quantum Circuit (PQC). This methodology offers potential advantages in process monitoring, modeling, forecasting, and optimization, enabling more efficient bioprocess management by reducing the dependence on scarce experimental data. Our results demonstrate acceptable performance in capturing the temporal dynamics of real bioprocess data. We focus on Optical Density, a key measurement for Dry Biomass estimation. The data generated showed high fidelity to the actual historical experimental data. This intersection of quantum computing and machine learning has opened new frontiers in data analysis and generation, particularly in computationally intensive fields, for use cases such as increasing prediction accuracy for soft sensor design or for use in predictive control.
☆ LILO: Bayesian Optimization with Interactive Natural Language Feedback
For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives. We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes.
☆ On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
☆ Handling Extreme Class Imbalance: Using GANs in Data Augmentation for Suicide Prediction
Suicide prediction is the key for prevention, but real data with sufficient positive samples is rare and causes extreme class imbalance. We utilized machine learning (ML) to build the model and deep learning (DL) techniques, like Generative Adversarial Networks (GAN), to generate synthetic data samples to enhance the dataset. The initial dataset contained 656 samples, with only four positive cases, prompting the need for data augmentation. A variety of machine learning models, ranging from interpretable data models to black box algorithmic models, were used. On real test data, Logistic Regression (LR) achieved a weighted precision of 0.99, a weighted recall of 0.85, and a weighted F1 score of 0.91; Random Forest (RF) showed 0.98, 0.99, and 0.99, respectively; and Support Vector Machine (SVM) achieved 0.99, 0.76, and 0.86. LR and SVM correctly identified one suicide attempt case (sensitivity:1.0) and misclassified LR(20) and SVM (31) non-attempts as attempts (specificity: 0.85 & 0.76, respectively). RF identified 0 suicide attempt cases (sensitivity: 0.0) with 0 false positives (specificity: 1.0). These results highlight the models' effectiveness, with GAN playing a key role in generating synthetic data to support suicide prevention modeling efforts.
☆ Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.
comment: 7 pages, 1 figure, FLTA 2025
☆ ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.
comment: 14 pages, 6 figures, 2 tables. Primary subject: cs.LG (Machine Learning) Cross-listed to: cs.CV (Computer Vision and Pattern Recognition), eess.IV (Image and Video Processing). Code available at: https://github.com/Bluesman79/ZACH-ViT Installation: pip install zachvit Paper licensed under CC BY-NC-ND 4.0. Code released under Apache 2.0 License
☆ Quantum Federated Learning: Architectural Elements and Future Directions
Federated learning (FL) focuses on collaborative model training without the need to move the private data silos to a central server. Despite its several benefits, the classical FL is plagued with several limitations, such as high computational power required for model training(which is critical for low-resource clients), privacy risks, large update traffic, and non-IID heterogeneity. This chapter surveys a hybrid paradigm - Quantum Federated Learning (QFL), which introduces quantum computation, that addresses multiple challenges of classical FL and offers rapid computing capability while keeping the classical orchestration intact. Firstly, we motivate QFL with a concrete presentation on pain points of classical FL, followed by a discussion on a general architecture of QFL frameworks specifying the roles of client and server, communication primitives and the quantum model placement. We classify the existing QFL systems based on four criteria - quantum architecture (pure QFL, hybrid QFL), data processing method (quantum data encoding, quantum feature mapping, and quantum feature selection & dimensionality reduction), network topology (centralized, hierarchial, decentralized), and quantum security mechanisms (quantum key distribution, quantum homomorphic encryption, quantum differential privacy, blind quantum computing). We then describe applications of QFL in healthcare, vehicular networks, wireless networks, and network security, clearly highlighting where QFL improves communication efficiency, security, and performance compared to classical FL. We close with multiple challenges and future works in QFL, including extension of QFL beyond classification tasks, adversarial attacks, realistic hardware deployment, quantum communication protocols deployment, aggregation of different quantum models, and quantum split learning as an alternative to QFL.
comment: 28 PAGES, 11 figures, introductory review article (book chapter), to be published in a book with springer
☆ RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation ICRA2026
Vision-Language-Action models (VLAs) have demonstrated remarkable performance on complex robotic manipulation tasks through imitation learning. However, existing imitation learning datasets contain only successful trajectories and lack failure or recovery data, especially for out-of-distribution (OOD) states where the robot deviates from the main policy due to minor perturbations or errors, leading VLA models to struggle with states deviating from the training distribution. To this end, we propose an automated OOD data augmentation framework named RESample through exploratory sampling. Specifically, we first leverage offline reinforcement learning to obtain an action-value network that accurately identifies sub-optimal actions under the current manipulation policy. We further sample potential OOD states from trajectories via rollout, and design an exploratory sampling mechanism that adaptively incorporates these action proxies into the training dataset to ensure efficiency. Subsequently, our framework explicitly encourages the VLAs to recover from OOD states and enhances their robustness against distributional shifts. We conduct extensive experiments on the LIBERO benchmark as well as real-world robotic manipulation tasks, demonstrating that RESample consistently improves the stability and generalization ability of VLA models.
comment: 9 pages,7 figures, submitted to ICRA2026
☆ LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena
Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.
comment: https://www.prophetarena.co/
☆ Just-In-Time Piecewise-Linear Semantics for ReLU-type Networks
We present a JIT PL semantics for ReLU-type networks that compiles models into a guarded CPWL transducer with shared guards. The system adds hyperplanes only when operands are affine on the current cell, maintains global lower/upper envelopes, and uses a budgeted branch-and-bound. We obtain anytime soundness, exactness on fully refined cells, monotone progress, guard-linear complexity (avoiding global $\binom{k}{2}$), dominance pruning, and decidability under finite refinement. The shared carrier supports region extraction, decision complexes, Jacobians, exact/certified Lipschitz, LP/SOCP robustness, and maximal causal influence. A minimal prototype returns certificates or counterexamples with cost proportional to visited subdomains.
☆ Non-asymptotic error bounds for probability flow ODEs under weak log-concavity
Score-based generative modeling, implemented through probability flow ODEs, has shown impressive results in numerous practical settings. However, most convergence guarantees rely on restrictive regularity assumptions on the target distribution -- such as strong log-concavity or bounded support. This work establishes non-asymptotic convergence bounds in the 2-Wasserstein distance for a general class of probability flow ODEs under considerably weaker assumptions: weak log-concavity and Lipschitz continuity of the score function. Our framework accommodates non-log-concave distributions, such as Gaussian mixtures, and explicitly accounts for initialization errors, score approximation errors, and effects of discretization via an exponential integrator scheme. Bridging a key theoretical challenge in diffusion-based generative modeling, our results extend convergence theory to more realistic data distributions and practical ODE solvers. We provide concrete guarantees for the efficiency and correctness of the sampling algorithm, complementing the empirical success of diffusion models with rigorous theory. Moreover, from a practical perspective, our explicit rates might be helpful in choosing hyperparameters, such as the step size in the discretization.
☆ Reasoning Distillation and Structural Alignment for Improved Code Generation
Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.
☆ HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection
Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.
comment: Accepted by the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) as a findings long paper
☆ MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.
comment: 16 pages, 3 tables, 1 figure
☆ CEPerFed: Communication-Efficient Personalized Federated Learning for Multi-Pulse MRI Classification
Multi-pulse magnetic resonance imaging (MRI) is widely utilized for clinical practice such as Alzheimer's disease diagnosis. To train a robust model for multi-pulse MRI classification, it requires large and diverse data from various medical institutions while protecting privacy by preventing raw data sharing across institutions. Although federated learning (FL) is a feasible solution to address this issue, it poses challenges of model convergence due to the effect of data heterogeneity and substantial communication overhead due to large numbers of parameters transmitted within the model. To address these challenges, we propose CEPerFed, a communication-efficient personalized FL method. It mitigates the effect of data heterogeneity by incorporating client-side historical risk gradients and historical mean gradients to coordinate local and global optimization. The former is used to weight the contributions from other clients, enhancing the reliability of local updates, while the latter enforces consistency between local updates and the global optimization direction to ensure stable convergence across heterogeneous data distributions. To address the high communication overhead, we propose a hierarchical SVD (HSVD) strategy that transmits only the most critical information required for model updates. Experiments on five classification tasks demonstrate the effectiveness of the CEPerFed method. The code will be released upon acceptance at https://github.com/LD0416/CEPerFed.
☆ Semi-supervised Latent Bayesian Optimization for Designing Antimicrobial Peptides
Antimicrobial peptides (AMPs) are a promising class of therapeutics to treat bacterial infections. Discovering and designing such peptides is difficult because of the vast number of possible sequences of amino acids. Deep generative models, such as variational autoencoders, have shown value in peptide design due to their ability to model sequence space with a continuous-valued latent space. Although such models have already been used to great effect in biomolecular design, they still suffer from a lack of interpretability and rigorous quantification of latent space quality as a search space. We investigate (1) whether further compression of the design space via dimensionality reduction may facilitate optimization, (2) the interpretability of the spaces, and (3) how organizing latent spaces with physicochemical properties may improve the efficiency of optimizing antimicrobial activity. We find that further reduction of the latent space via dimensionality reduction can be advantageous when organizing the space with more relevant information at data availability, that using the dimensionality reduction search space can be more interpretable, and that we can organize the latent space with different physicochemical properties even at different percentages of available labels.
comment: 19 pages, 9 figures
☆ An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning
In safety-critical domains such as robotics, navigation and power systems, constrained optimization problems arise where maximizing performance must be carefully balanced with associated constraints. Safe reinforcement learning provides a framework to address these challenges, with Lagrangian methods being a popular choice. However, the effectiveness of Lagrangian methods crucially depends on the choice of the Lagrange multiplier $\lambda$, which governs the trade-off between return and constraint cost. A common approach is to update the multiplier automatically during training. Although this is standard in practice, there remains limited empirical evidence on the robustness of an automated update and its influence on overall performance. Therefore, we analyze (i) optimality and (ii) stability of Lagrange multipliers in safe reinforcement learning across a range of tasks. We provide $\lambda$-profiles that give a complete visualization of the trade-off between return and constraint cost of the optimization problem. These profiles show the highly sensitive nature of $\lambda$ and moreover confirm the lack of general intuition for choosing the optimal value $\lambda^*$. Our findings additionally show that automated multiplier updates are able to recover and sometimes even exceed the optimal performance found at $\lambda^*$ due to the vast difference in their learning trajectories. Furthermore, we show that automated multiplier updates exhibit oscillatory behavior during training, which can be mitigated through PID-controlled updates. However, this method requires careful tuning to achieve consistently better performance across tasks. This highlights the need for further research on stabilizing Lagrangian methods in safe reinforcement learning. The code used to reproduce our results can be found at https://github.com/lindsayspoor/Lagrangian_SafeRL.
☆ Formally Exploring Time-Series Anomaly Detection Evaluation Metrics
Undetected anomalies in time series can trigger catastrophic failures in safety-critical systems, such as chemical plant explosions or power grid outages. Although many detection methods have been proposed, their performance remains unclear because current metrics capture only narrow aspects of the task and often yield misleading results. We address this issue by introducing verifiable properties that formalize essential requirements for evaluating time-series anomaly detection. These properties enable a theoretical framework that supports principled evaluations and reliable comparisons. Analyzing 37 widely used metrics, we show that most satisfy only a few properties, and none satisfy all, explaining persistent inconsistencies in prior results. To close this gap, we propose LARM, a flexible metric that provably satisfies all properties, and extend it to ALARM, an advanced variant meeting stricter requirements.
comment: 73 pages, 13 figures
☆ The Free Transformer
We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.
☆ TrajMamba: An Efficient and Semantic-rich Vehicle Trajectory Pre-training Model NeurIPS2025
Vehicle GPS trajectories record how vehicles move over time, storing valuable travel semantics, including movement patterns and travel purposes. Learning travel semantics effectively and efficiently is crucial for real-world applications of trajectory data, which is hindered by two major challenges. First, travel purposes are tied to the functions of the roads and points-of-interest (POIs) involved in a trip. Such information is encoded in textual addresses and descriptions and introduces heavy computational burden to modeling. Second, real-world trajectories often contain redundant points, which harm both computational efficiency and trajectory embedding quality. To address these challenges, we propose TrajMamba, a novel approach for efficient and semantically rich vehicle trajectory learning. TrajMamba introduces a Traj-Mamba Encoder that captures movement patterns by jointly modeling both GPS and road perspectives of trajectories, enabling robust representations of continuous travel behaviors. It also incorporates a Travel Purpose-aware Pre-training procedure to integrate travel purposes into the learned embeddings without introducing extra overhead to embedding calculation. To reduce redundancy in trajectories, TrajMamba features a Knowledge Distillation Pre-training scheme to identify key trajectory points through a learnable mask generator and obtain effective compressed trajectory embeddings. Extensive experiments on two real-world datasets and three downstream tasks show that TrajMamba outperforms state-of-the-art baselines in both efficiency and accuracy.
comment: Accepted by NeurIPS2025
☆ Reliable Inference in Edge-Cloud Model Cascades via Conformal Alignment
Edge intelligence enables low-latency inference via compact on-device models, but assuring reliability remains challenging. We study edge-cloud cascades that must preserve conditional coverage: whenever the edge returns a prediction set, it should contain the true label with a user-specified probability, as if produced by the cloud model. We formalize conditional coverage with respect to the cloud predictive distribution, and introduce a conformal alignment-based (CAb) cascading mechanism that certifies this property with user control over the risk level. Our method casts escalation from edge to cloud models as a multiple-hypothesis testing (MHT) problem, tailoring conformal alignment (CA) to select which inputs can be safely handled at the edge. The proposed CAb model cascading method yields statistical guarantees on the average fraction of edge decisions that satisfy cloud-level conditional coverage. The procedure applies to arbitrary edge prediction sets, including variants of conformal prediction (CP), and exposes a tunable trade-off among coverage, deferral rate, and set size. Experiments on CIFAR-100 image classification and the TeleQnA question-answering (QA) benchmark show that the proposed CAb cascade maintains the target conditional coverage for edge predictions while substantially reducing offloading to the cloud and incurring modest increases in prediction-set size.
comment: Under Review
☆ OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction
Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.
☆ Plasma Shape Control via Zero-shot Generative Reinforcement Learning
Traditional PID controllers have limited adaptability for plasma shape control, and task-specific reinforcement learning (RL) methods suffer from limited generalization and the need for repetitive retraining. To overcome these challenges, this paper proposes a novel framework for developing a versatile, zero-shot control policy from a large-scale offline dataset of historical PID-controlled discharges. Our approach synergistically combines Generative Adversarial Imitation Learning (GAIL) with Hilbert space representation learning to achieve dual objectives: mimicking the stable operational style of the PID data and constructing a geometrically structured latent space for efficient, goal-directed control. The resulting foundation policy can be deployed for diverse trajectory tracking tasks in a zero-shot manner without any task-specific fine-tuning. Evaluations on the HL-3 tokamak simulator demonstrate that the policy excels at precisely and stably tracking reference trajectories for key shape parameters across a range of plasma scenarios. This work presents a viable pathway toward developing highly flexible and data-efficient intelligent control systems for future fusion reactors.
☆ MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation
Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.
☆ How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime?
The capacity of deep learning models is often large enough to both learn the underlying statistical signal and overfit to noise in the training set. This noise memorization can be harmful especially for data with a low signal-to-noise ratio (SNR), leading to poor generalization. Inspired by prior observations that label noise provides implicit regularization that improves generalization, in this work, we investigate whether introducing label noise to the gradient updates can enhance the test performance of neural network (NN) in the low SNR regime. Specifically, we consider training a two-layer NN with a simple label noise gradient descent (GD) algorithm, in an idealized signal-noise data setting. We prove that adding label noise during training suppresses noise memorization, preventing it from dominating the learning process; consequently, label noise GD enjoys rapid signal growth while the overfitting remains controlled, thereby achieving good generalization despite the low SNR. In contrast, we also show that NN trained with standard GD tends to overfit to noise in the same low SNR setting and establish a non-vanishing lower bound on its test error, thus demonstrating the benefit of introducing label noise in gradient-based training.
comment: 40 pages
☆ Mitigating Clever Hans Strategies in Image Classifiers through Generating Counterexamples
Deep learning models remain vulnerable to spurious correlations, leading to so-called Clever Hans predictors that undermine robustness even in large-scale foundation and self-supervised models. Group distributional robustness methods, such as Deep Feature Reweighting (DFR) rely on explicit group labels to upweight underrepresented subgroups, but face key limitations: (1) group labels are often unavailable, (2) low within-group sample sizes hinder coverage of the subgroup distribution, and (3) performance degrades sharply when multiple spurious correlations fragment the data into even smaller groups. We propose Counterfactual Knowledge Distillation (CFKD), a framework that sidesteps these issues by generating diverse counterfactuals, enabling a human annotator to efficiently explore and correct the model's decision boundaries through a knowledge distillation step. Unlike DFR, our method not only reweights the undersampled groups, but it also enriches them with new data points. Our method does not require any confounder labels, achieves effective scaling to multiple confounders, and yields balanced generalization across groups. We demonstrate CFKD's efficacy across five datasets, spanning synthetic tasks to an industrial application, with particularly strong gains in low-data regimes with pronounced spurious correlations. Additionally, we provide an ablation study on the effect of the chosen counterfactual explainer and teacher model, highlighting their impact on robustness.
☆ Curiosity Meets Cooperation: A Game-Theoretic Approach to Long-Tail Multi-Label Learning
Long-tail imbalance is endemic to multi-label learning: a few head labels dominate the gradient signal, while the many rare labels that matter in practice are silently ignored. We tackle this problem by casting the task as a cooperative potential game. In our Curiosity-Driven Game-Theoretic Multi-Label Learning (CD-GTMLL) framework, the label space is split among several cooperating players that share a global accuracy payoff yet earn additional curiosity rewards that rise with label rarity and inter-player disagreement. These curiosity bonuses inject gradient on under-represented tags without hand-tuned class weights. We prove that gradient best-response updates ascend a differentiable potential and converge to tail-aware stationary points that tighten a lower bound on the expected Rare-F1. Extensive experiments on conventional benchmarks and three extreme-scale datasets show consistent state-of-the-art gains, delivering up to +4.3% Rare-F1 and +1.6% P@3 over the strongest baselines, while ablations reveal emergent division of labour and faster consensus on rare classes. CD-GTMLL thus offers a principled, scalable route to long-tail robustness in multi-label prediction.
comment: Under review
☆ SAFE-D: A Spatiotemporal Detection Framework for Abnormal Driving Among Parkinson's Disease-like Drivers
A driver's health state serves as a determinant factor in driving behavioral regulation. Subtle deviations from normalcy can lead to operational anomalies, posing risks to public transportation safety. While prior efforts have developed detection mechanisms for functionally-driven temporary anomalies such as drowsiness and distraction, limited research has addressed pathologically-triggered deviations, especially those stemming from chronic medical conditions. To bridge this gap, we investigate the driving behavior of Parkinson's disease patients and propose SAFE-D, a novel framework for detecting Parkinson-related behavioral anomalies to enhance driving safety. Our methodology starts by performing analysis of Parkinson's disease symptomatology, focusing on primary motor impairments, and establishes causal links to degraded driving performance. To represent the subclinical behavioral variations of early-stage Parkinson's disease, our framework integrates data from multiple vehicle control components to build a behavioral profile. We then design an attention-based network that adaptively prioritizes spatiotemporal features, enabling robust anomaly detection under physiological variability. Finally, we validate SAFE-D on the Logitech G29 platform and CARLA simulator, using data from three road maps to emulate real-world driving. Our results show SAFE-D achieves 96.8% average accuracy in distinguishing normal and Parkinson-affected driving patterns.
☆ SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
comment: Project Website: http://simbench.tiancheng.hu/ Data: https://huggingface.co/datasets/pitehu/SimBench
☆ The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis NeurIPS 2025
Sparse neural networks promise efficiency, yet training them effectively remains a fundamental challenge. Despite advances in pruning methods that create sparse architectures, understanding why some sparse structures are better trainable than others with the same level of sparsity remains poorly understood. Aiming to develop a systematic approach to this fundamental problem, we propose a novel theoretical framework based on the theory of graph limits, particularly graphons, that characterizes sparse neural networks in the infinite-width regime. Our key insight is that connectivity patterns of sparse neural networks induced by pruning methods converge to specific graphons as networks' width tends to infinity, which encodes implicit structural biases of different pruning methods. We postulate the Graphon Limit Hypothesis and provide empirical evidence to support it. Leveraging this graphon representation, we derive a Graphon Neural Tangent Kernel (Graphon NTK) to study the training dynamics of sparse networks in the infinite width limit. Graphon NTK provides a general framework for the theoretical analysis of sparse networks. We empirically show that the spectral analysis of Graphon NTK correlates with observed training dynamics of sparse networks, explaining the varying convergence behaviours of different pruning methods. Our framework provides theoretical insights into the impact of connectivity patterns on the trainability of various sparse network architectures.
comment: NeurIPS 2025 Spotlight
☆ AWARE: Audio Watermarking with Adversarial Resistance to Edits
Prevailing practice in learning-based audio watermarking is to pursue robustness by expanding the set of simulated distortions during training. However, such surrogates are narrow and prone to overfitting. This paper presents AWARE (Audio Watermarking with Adversarial Resistance to Edits), an alternative approach that avoids reliance on attack-simulation stacks and handcrafted differentiable distortions. Embedding is obtained via adversarial optimization in the time-frequency domain under a level-proportional perceptual budget. Detection employs a time-order-agnostic detector with a Bitwise Readout Head (BRH) that aggregates temporal evidence into one score per watermark bit, enabling reliable watermark decoding even under desynchronization and temporal cuts. Empirically, AWARE attains high audio quality and speech intelligibility (PESQ/STOI) and consistently low BER across various audio edits, often surpassing representative state-of-the-art learning-based audio watermarking systems.
☆ Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares NeurIPS2025
Classical optimisation theory guarantees monotonic objective decrease for gradient descent (GD) when employed in a small step size, or ``stable", regime. In contrast, gradient descent on neural networks is frequently performed in a large step size regime called the ``edge of stability", in which the objective decreases non-monotonically with an observed implicit bias towards flat minima. In this paper, we take a step toward quantifying this phenomenon by providing convergence rates for gradient descent with large learning rates in an overparametrised least squares setting. The key insight behind our analysis is that, as a consequence of overparametrisation, the set of global minimisers forms a Riemannian manifold $M$, which enables the decomposition of the GD dynamics into components parallel and orthogonal to $M$. The parallel component corresponds to Riemannian gradient descent on the objective sharpness, while the orthogonal component is a bifurcating dynamical system. This insight allows us to derive convergence rates in three regimes characterised by the learning rate size: (a) the subcritical regime, in which transient instability is overcome in finite time before linear convergence to a suboptimally flat global minimum; (b) the critical regime, in which instability persists for all time with a power-law convergence toward the optimally flat global minimum; and (c) the supercritical regime, in which instability persists for all time with linear convergence to an orbit of period two centred on the optimally flat global minimum.
comment: NeurIPS2025. Code available at https://github.com/lemacdonald/eos-convergence-rates-codimension-1
☆ Stochastic Difference-of-Convex Optimization with Momentum
Stochastic difference-of-convex (DC) optimization is prevalent in numerous machine learning applications, yet its convergence properties under small batch sizes remain poorly understood. Existing methods typically require large batches or strong noise assumptions, which limit their practical use. In this work, we show that momentum enables convergence under standard smoothness and bounded variance assumptions (of the concave part) for any batch size. We prove that without momentum, convergence may fail regardless of stepsize, highlighting its necessity. Our momentum-based algorithm achieves provable convergence and demonstrates strong empirical performance.
☆ I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models NeurIPS 2025
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
comment: Accepted at the 5th Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2025
☆ DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning NeurIPS 2025
Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at https://github.com/heyongxin233/DETree.
comment: To appear in NeurIPS 2025
Local properties of neural networks through the lens of layer-wise Hessians
We introduce a methodology for analyzing neural networks through the lens of layer-wise Hessian matrices. The local Hessian of each functional block (layer) is defined as the matrix of second derivatives of a scalar function with respect to the parameters of that layer. This concept provides a formal tool for characterizing the local geometry of the parameter space. We show that the spectral properties of local Hessians, such as the distribution of eigenvalues, reveal quantitative patterns associated with overfitting, underparameterization, and expressivity in neural network architectures. We conduct an extensive empirical study involving 111 experiments across 37 datasets. The results demonstrate consistent structural regularities in the evolution of local Hessians during training and highlight correlations between their spectra and generalization performance. These findings establish a foundation for using local geometric analysis to guide the diagnosis and design of deep neural networks. The proposed framework connects optimization geometry with functional behavior and offers practical insight for improving network architectures and training stability.
comment: Comments: 22 pages, 8 figures. Submitted to arXiv:cs.LG
☆ Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization
Decentralized Learning (DL) enables users to collaboratively train models without sharing raw data by iteratively averaging local updates with neighbors in a network graph. This setting is increasingly popular for its scalability and its ability to keep data local under user control. Strong privacy guarantees in DL are typically achieved through Differential Privacy (DP), with results showing that DL can even amplify privacy by disseminating noise across peer-to-peer communications. Yet in practice, the observed privacy-utility trade-off often appears worse than in centralized training, which may be due to limitations in current DP accounting methods for DL. In this paper, we show that recent advances in centralized DP accounting based on Matrix Factorization (MF) for analyzing temporal noise correlations can also be leveraged in DL. By generalizing existing MF results, we show how to cast both standard DL algorithms and common trust models into a unified formulation. This yields tighter privacy accounting for existing DP-DL algorithms and provides a principled way to develop new ones. To demonstrate the approach, we introduce MAFALDA-SGD, a gossip-based DL algorithm with user-level correlated noise that outperforms existing methods on synthetic and real-world graphs.
comment: 21 pages, 5 figures
☆ Towards geological inference with process-based and deep generative modeling, part 2: inversion of fluvial deposits and latent-space disentanglement
High costs and uncertainties make subsurface decision-making challenging, as acquiring new data is rarely scalable. Embedding geological knowledge directly into predictive models offers a valuable alternative. A joint approach enables just that: process-based models that mimic geological processes can help train generative models that make predictions more efficiently. This study explores whether a generative adversarial network (GAN) - a type of deep-learning algorithm for generative modeling - trained to produce fluvial deposits can be inverted to match well and seismic data. Four inversion approaches applied to three test samples with 4, 8, and 20 wells struggled to match these well data, especially as the well number increased or as the test sample diverged from the training data. The key bottleneck lies in the GAN's latent representation: it is entangled, so samples with similar sedimentological features are not necessarily close in the latent space. Label conditioning or latent overparameterization can partially disentangle the latent space during training, although not yet sufficiently for a successful inversion. Fine-tuning the GAN to restructure the latent space locally reduces mismatches to acceptable levels for all test cases, with and without seismic data. But this approach depends on an initial, partially successful inversion step, which influences the quality and diversity of the final samples. Overall, GANs can already handle the tasks required for their integration into geomodeling workflows. We still need to further assess their robustness, and how to best leverage them in support of geological interpretation.
comment: 52 pages, 42 figures
☆ DAMSDAN: Distribution-Aware Multi-Source Domain Adaptation Network for Cross-Domain EEG-based Emotion Recognition
Significant inter-individual variability limits the generalization of EEG-based emotion recognition under cross-domain settings. We address two core challenges in multi-source adaptation: (1) dynamically modeling distributional heterogeneity across sources and quantifying their relevance to a target to reduce negative transfer; and (2) achieving fine-grained semantic consistency to strengthen class discrimination. We propose a distribution-aware multi-source domain adaptation network (DAMSDAN). DAMSDAN integrates prototype-based constraints with adversarial learning to drive the encoder toward discriminative, domain-invariant emotion representations. A domain-aware source weighting strategy based on maximum mean discrepancy (MMD) dynamically estimates inter-domain shifts and reweights source contributions. In addition, a prototype-guided conditional alignment module with dual pseudo-label interaction enhances pseudo-label reliability and enables category-level, fine-grained alignment, mitigating noise propagation and semantic drift. Experiments on SEED and SEED-IV show average accuracies of 94.86\% and 79.78\% for cross-subject, and 95.12\% and 83.15\% for cross-session protocols. On the large-scale FACED dataset, DAMSDAN achieves 82.88\% (cross-subject). Extensive ablations and interpretability analyses corroborate the effectiveness of the proposed framework for cross-domain EEG-based emotion recognition.
comment: 14 pages, 9 figures
☆ Certified Self-Consistency: Statistical Guarantees and Test-Time Training for Reliable Reasoning in LLMs
Recent advances such as self-consistency and test-time reinforcement learning (TTRL) improve the reliability of large language models (LLMs) without additional supervision, yet their underlying mechanisms and statistical guarantees remain poorly understood. We present a unified framework for certifiable inference in LLMs, showing that majority voting provides a statistical certificate of self-consistency: under mild assumptions, the aggregated answer coincides with the mode of the model's terminal distribution with high probability. We derive finite-sample and anytime-valid concentration bounds that quantify this confidence, and introduce the Martingale Majority Certificate (MMC), a sequential stopping rule that adaptively determines when sufficient samples have been drawn. We further prove that label-free post-training methods such as TTRL implicitly sharpen the answer distribution by exponentially tilting it toward its mode, thereby reducing the number of samples required for certification. Building on this insight, we propose new post-training objectives that explicitly optimise this trade-off between sharpness and bias. Together, these results explain and connect two central test-time scaling strategies, self-consistency and TTRL, within a single statistical framework for label-free, certifiable reliability in reasoning LLMs.
☆ Layer Specialization Underlying Compositional Reasoning in Transformers
Transformers exhibit compositional reasoning on sequences not observed during training, a capability often attributed to in-context learning (ICL) and skill composition. We investigate this phenomenon using the Random Hierarchy Model (RHM), a probabilistic context-free grammar that generates sequences through recursive rule application. Models are trained on subsets of sequences and evaluated across four generalization conditions: memorization, in-distribution generalization, out-of-distribution generalization with the same rules, and cross-layer transfer. Behaviorally, performance improves systematically with task complexity and the number of in-context examples, with out-of-distribution tasks requiring substantially more examples than in-distribution scenarios. Mechanistically, we identify a progressive emergence of layer specialization during training that correlates with generalization performance. Principal component analysis and attention pattern clustering reveal that transformers develop structured, hierarchically organized representations in specialized layers. These results demonstrate that transformers develop modular, interpretable mechanisms supporting compositional reasoning, linking internal algorithmic structure to observed behavioral capabilities.
☆ CrossStateECG: Multi-Scale Deep Convolutional Network with Attention for Rest-Exercise ECG Biometrics
Current research in Electrocardiogram (ECG) biometrics mainly emphasizes resting-state conditions, leaving the performance decline in rest-exercise scenarios largely unresolved. This paper introduces CrossStateECG, a robust ECG-based authentication model explicitly tailored for cross-state (rest-exercise) conditions. The proposed model creatively combines multi-scale deep convolutional feature extraction with attention mechanisms to ensure strong identification across different physiological states. Experimental results on the exercise-ECGID dataset validate the effectiveness of CrossStateECG, achieving an identification accuracy of 92.50% in the Rest-to-Exercise scenario (training on resting ECG and testing on post-exercise ECG) and 94.72% in the Exercise-to-Rest scenario (training on post-exercise ECG and testing on resting ECG). Furthermore, CrossStateECG demonstrates exceptional performance across both state combinations, reaching an accuracy of 99.94% in Rest-to-Rest scenarios and 97.85% in Mixed-to-Mixed scenarios. Additional validations on the ECG-ID and MIT-BIH datasets further confirmed the generalization abilities of CrossStateECG, underscoring its potential as a practical solution for post-exercise ECG-based authentication in dynamic real-world settings.
☆ Estimating Orbital Parameters of Direct Imaging Exoplanet Using Neural Network
In this work, we propose a new flow-matching Markov chain Monte Carlo (FM-MCMC) algorithm for estimating the orbital parameters of exoplanetary systems, especially for those only one exoplanet is involved. Compared to traditional methods that rely on random sampling within the Bayesian framework, our approach first leverages flow matching posterior estimation (FMPE) to efficiently constrain the prior range of physical parameters, and then employs MCMC to accurately infer the posterior distribution. For example, in the orbital parameter inference of beta Pictoris b, our model achieved a substantial speed-up while maintaining comparable accuracy-running 77.8 times faster than Parallel Tempered MCMC (PTMCMC) and 365.4 times faster than nested sampling. Moreover, our FM-MCMC method also attained the highest average log-likelihood among all approaches, demonstrating its superior sampling efficiency and accuracy. This highlights the scalability and efficiency of our approach, making it well-suited for processing the massive datasets expected from future exoplanet surveys. Beyond astrophysics, our methodology establishes a versatile paradigm for synergizing deep generative models with traditional sampling, which can be adopted to tackle complex inference problems in other fields, such as cosmology, biomedical imaging, and particle physics.
☆ Explainable AI for microseismic event detection
Deep neural networks like PhaseNet show high accuracy in detecting microseismic events, but their black-box nature is a concern in critical applications. We apply explainable AI (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM) and Shapley Additive Explanations (SHAP), to interpret the PhaseNet model's decisions and improve its reliability. Grad-CAM highlights that the network's attention aligns with P- and S-wave arrivals. SHAP values quantify feature contributions, confirming that vertical-component amplitudes drive P-phase picks while horizontal components dominate S-phase picks, consistent with geophysical principles. Leveraging these insights, we introduce a SHAP-gated inference scheme that combines the model's output with an explanation-based metric to reduce errors. On a test set of 9,000 waveforms, the SHAP-gated model achieved an F1-score of 0.98 (precision 0.99, recall 0.97), outperforming the baseline PhaseNet (F1-score 0.97) and demonstrating enhanced robustness to noise. These results show that XAI can not only interpret deep learning models but also directly enhance their performance, providing a template for building trust in automated seismic detectors.
comment: Submitted to Artificial Intelligence in Geosciences
☆ Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models NeurIPS 25
Message Passing Neural Networks (MPNNs) is the building block of graph foundation models, but fundamentally suffer from oversmoothing and oversquashing. There has recently been a surge of interest in fixing both issues. Existing efforts primarily adopt global approaches, which may be beneficial in some regions but detrimental in others, ultimately leading to the suboptimal expressiveness. In this paper, we begin by revisiting oversquashing through a global measure -- spectral gap $\lambda$ -- and prove that the increase of $\lambda$ leads to gradient vanishing with respect to the input features, thereby undermining the effectiveness of message passing. Motivated by such theoretical insights, we propose a \textbf{local} approach that adaptively adjusts message passing based on local structures. To achieve this, we connect local Riemannian geometry with MPNNs, and establish a novel nonhomogeneous boundary condition to address both oversquashing and oversmoothing. Building on the Robin condition, we design a GBN network with local bottleneck adjustment, coupled with theoretical guarantees. Extensive experiments on homophilic and heterophilic graphs show the expressiveness of GBN. Furthermore, GBN does not exhibit performance degradation even when the network depth exceeds $256$ layers.
comment: Accept by NeurIPS 25
☆ The Parameterized Complexity of Computing the VC-Dimension NeurIPS 2025
The VC-dimension is a fundamental and well-studied measure of the complexity of a set system (or hypergraph) that is central to many areas of machine learning. We establish several new results on the complexity of computing the VC-dimension. In particular, given a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$, we prove that the naive $2^{\mathcal{O}(|\mathcal{V}|)}$-time algorithm is asymptotically tight under the Exponential Time Hypothesis (ETH). We then prove that the problem admits a 1-additive fixed-parameter approximation algorithm when parameterized by the maximum degree of $\mathcal{H}$ and a fixed-parameter algorithm when parameterized by its dimension, and that these are essentially the only such exploitable structural parameters. Lastly, we consider a generalization of the problem, formulated using graphs, which captures the VC-dimension of both set systems and graphs. We show that it is fixed-parameter tractable parameterized by the treewidth of the graph (which, in the case of set systems, applies to the treewidth of its incidence graph). In contrast with closely related problems whose dependency on the treewidth is necessarily double-exponential (assuming the ETH), our algorithm has a relatively low dependency on the treewidth.
comment: To appear in the proceedings of NeurIPS 2025
☆ From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.
comment: Project page: https://falcon-vla.github.io/
☆ Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
Quantifying Climate Policy Action and Its Links to Development Outcomes: A Cross-National Data-Driven Analysis NeurIPS 2025
Addressing climate change effectively requires more than cataloguing the number of policies in place; it calls for tools that can reveal their thematic priorities and their tangible impacts on development outcomes. Existing assessments often rely on qualitative descriptions or composite indices, which can mask crucial differences between key domains such as mitigation, adaptation, disaster risk management, and loss and damage. To bridge this gap, we develop a quantitative indicator of climate policy orientation by applying a multilingual transformer-based language model to official national policy documents, achieving a classification accuracy of 0.90 (F1-score). Linking these indicators with World Bank development data in panel regressions reveals that mitigation policies are associated with higher GDP and GNI; disaster risk management correlates with greater GNI and debt but reduced foreign direct investment; adaptation and loss and damage show limited measurable effects. This integrated NLP-econometric framework enables comparable, theme-specific analysis of climate governance, offering a scalable method to monitor progress, evaluate trade-offs, and align policy emphasis with development goals.
comment: This paper/proposal has been accepted as a poster in the NeurIPS 2025
Diffusion Models as Dataset Distillation Priors
Dataset distillation aims to synthesize compact yet informative datasets from large ones. A significant challenge in this field is achieving a trifecta of diversity, generalization, and representativeness in a single distilled dataset. Although recent generative dataset distillation methods adopt powerful diffusion models as their foundation models, the inherent representativeness prior in diffusion models is overlooked. Consequently, these approaches often necessitate the integration of external constraints to enhance data quality. To address this, we propose Diffusion As Priors (DAP), which formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel. We then introduce this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. Extensive experiments on large-scale datasets, such as ImageNet-1K and its subsets, demonstrate that DAP outperforms state-of-the-art methods in generating high-fidelity datasets while achieving superior cross-architecture generalization. Our work not only establishes a theoretical connection between diffusion priors and the objectives of dataset distillation but also provides a practical, training-free framework for improving the quality of the distilled dataset.
☆ A Conditional Diffusion Model for Probabilistic Prediction of Battery Capacity Degradation
Accurate prediction of lithium-ion battery capacity and its associated uncertainty is essential for reliable battery management but remains challenging due to the stochastic nature of aging. This paper presents a novel method, termed the Condition Diffusion U-Net with Attention (CDUA), which integrates feature engineering and deep learning to address this challenge. The proposed approach employs a diffusion-based generative model for time-series forecasting and incorporates attention mechanisms to enhance predictive performance. Battery capacity is first derived from real-world vehicle operation data. The most relevant features are then identified using the Pearson correlation coefficient and the XGBoost algorithm. These features are used to train the CDUA model, which comprises two core components: (1) a contextual U-Net with self-attention to capture complex temporal dependencies, and (2) a denoising network to reconstruct accurate capacity values from noisy observations. Experimental validation on the real-world vehicle data demonstrates that the proposed CDUA model achieves a relative Mean Absolute Error (MAE) of 0.94% and a relative Root Mean Square Error (RMSE) of 1.14%, with a narrow 95% confidence interval of 3.74% in relative width. These results confirm that CDUA provides both accurate capacity estimation and reliable uncertainty quantification. Comparative experiments further verify its robustness and superior performance over existing mainstream approaches.
☆ S4ECG: Exploring the impact of long-range interactions for arrhythmia prediction
The electrocardiogram (ECG) exemplifies biosignal-based time series with continuous, temporally ordered structure reflecting cardiac physiological and pathophysiological dynamics. Detailed analysis of these dynamics has proven challenging, as conventional methods capture either global trends or local waveform features but rarely their simultaneous interplay at high temporal resolution. To bridge global and local signal analysis, we introduce S4ECG, a novel deep learning architecture leveraging structured state space models for multi-epoch arrhythmia classification. Our joint multi-epoch predictions significantly outperform single-epoch approaches by 1.0-11.6% in macro-AUROC, with atrial fibrillation specificity improving from 0.718-0.979 to 0.967-0.998, demonstrating superior performance in-distribution and enhanced out-of-distribution robustness. Systematic investigation reveals optimal temporal dependency windows spanning 10-20 minutes for peak performance. This work contributes to a paradigm shift toward temporally-aware arrhythmia detection algorithms, opening new possibilities for ECG interpretation, in particular for complex arrhythmias like atrial fibrillation and atrial flutter.
☆ Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine
Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
☆ RINS-T: Robust Implicit Neural Solvers for Time Series Linear Inverse Problems
Time series data are often affected by various forms of corruption, such as missing values, noise, and outliers, which pose significant challenges for tasks such as forecasting and anomaly detection. To address these issues, inverse problems focus on reconstructing the original signal from corrupted data by leveraging prior knowledge about its underlying structure. While deep learning methods have demonstrated potential in this domain, they often require extensive pretraining and struggle to generalize under distribution shifts. In this work, we propose RINS-T (Robust Implicit Neural Solvers for Time Series Linear Inverse Problems), a novel deep prior framework that achieves high recovery performance without requiring pretraining data. RINS-T leverages neural networks as implicit priors and integrates robust optimization techniques, making it resilient to outliers while relaxing the reliance on Gaussian noise assumptions. To further improve optimization stability and robustness, we introduce three key innovations: guided input initialization, input perturbation, and convex output combination techniques. Each of these contributions strengthens the framework's optimization stability and robustness. These advancements make RINS-T a flexible and effective solution for addressing complex real-world time series challenges. Our code is available at https://github.com/EPFL-IMOS/RINS-T.
comment: Accepted to IEEE Transactions on Instrumentation and Measurement
☆ MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning
The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.
comment: Accepted and presented at the 2025 International Joint Conference on Neural Networks (IJCNN'25). The paper was awarded an honorable mention (best 4 papers)
☆ Finite-Time Bounds for Average-Reward Fitted Q-Iteration
Although there is an extensive body of work characterizing the sample complexity of discounted-return offline RL with function approximations, prior work on the average-reward setting has received significantly less attention, and existing approaches rely on restrictive assumptions, such as ergodicity or linearity of the MDP. In this work, we establish the first sample complexity results for average-reward offline RL with function approximation for weakly communicating MDPs, a much milder assumption. To this end, we introduce Anchored Fitted Q-Iteration, which combines the standard Fitted Q-Iteration with an anchor mechanism. We show that the anchor, which can be interpreted as a form of weight decay, is crucial for enabling finite-time analysis in the average-reward setting. We also extend our finite-time analysis to the setup where the dataset is generated from a single-trajectory rather than IID transitions, again leveraging the anchor mechanism.
☆ Exploration via Feature Perturbation in Contextual Bandits NeurIPS 2025
We propose feature perturbation, a simple yet powerful technique that injects randomness directly into feature inputs, instead of randomizing unknown parameters or adding noise to rewards. Remarkably, this algorithm achieves $\tilde{\mathcal{O}}(d\sqrt{T})$ worst-case regret bound for generalized linear bandits, while avoiding the $\tilde{\mathcal{O}}(d^{3/2}\sqrt{T})$ regret typical of existing randomized bandit algorithms. Because our algorithm eschews parameter sampling, it is both computationally efficient and naturally extends to non-parametric or neural network models. We verify these advantages through empirical evaluations, demonstrating that feature perturbation not only surpasses existing methods but also unifies strong practical performance with best-known theoretical guarantees.
comment: Accepted at NeurIPS 2025 (spotlight)
☆ TabR1: Taming GRPO for tabular reasoning LLMs
Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
☆ Latent Spaces Beyond Synthesis: From GANs to Diffusion Models
This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi's account of synthesis as the amalgamation of distributed representations, we propose a distinction between "synthesis in a strict sense", where a compact latent space wholly determines the generative process, and "synthesis in a broad sense," which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.
comment: Presented and published at Ethics and Aesthetics of Artificial Intelligence Conference (EA-AI'25)
☆ Graph Attention-Guided Search for Dense Multi-Agent Pathfinding
Finding near-optimal solutions for dense multi-agent pathfinding (MAPF) problems in real-time remains challenging even for state-of-the-art planners. To this end, we develop a hybrid framework that integrates a learned heuristic derived from MAGAT, a neural MAPF policy with a graph attention scheme, into a leading search-based algorithm, LaCAM. While prior work has explored learning-guided search in MAPF, such methods have historically underperformed. In contrast, our approach, termed LaGAT, outperforms both purely search-based and purely learning-based methods in dense scenarios. This is achieved through an enhanced MAGAT architecture, a pre-train-then-fine-tune strategy on maps of interest, and a deadlock detection scheme to account for imperfect neural guidance. Our results demonstrate that, when carefully designed, hybrid search offers a powerful solution for tightly coupled, challenging multi-agent coordination problems.
☆ Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories
Detecting out-of-distribution (OOD) data is critical for machine learning, be it for safety reasons or to enable open-ended learning. However, beyond mere detection, choosing an appropriate course of action typically hinges on the type of OOD data encountered. Unfortunately, the latter is generally not distinguished in practice, as modern OOD detection methods collapse distributional shifts into single scalar outlier scores. This work argues that scalar-based methods are thus insufficient for OOD data to be properly contextualized and prospectively exploited, a limitation we overcome with the introduction of DISC: Diffusion-based Statistical Characterization. DISC leverages the iterative denoising process of diffusion models to extract a rich, multi-dimensional feature vector that captures statistical discrepancies across multiple noise levels. Extensive experiments on image and tabular benchmarks show that DISC matches or surpasses state-of-the-art detectors for OOD detection and, crucially, also classifies OOD type, a capability largely absent from prior work. As such, our work enables a shift from simple binary OOD detection to a more granular detection.
comment: 11 Pages, 6 Figures
☆ Optimizing Energy Management of Smart Grid using Reinforcement Learning aided by Surrogate models built using Physics-informed Neural Networks
Optimizing the energy management within a smart grids scenario presents significant challenges, primarily due to the complexity of real-world systems and the intricate interactions among various components. Reinforcement Learning (RL) is gaining prominence as a solution for addressing the challenges of Optimal Power Flow in smart grids. However, RL needs to iterate compulsively throughout a given environment to obtain the optimal policy. This means obtaining samples from a, most likely, costly simulator, which can lead to a sample efficiency problem. In this work, we address this problem by substituting costly smart grid simulators with surrogate models built using Phisics-informed Neural Networks (PINNs), optimizing the RL policy training process by arriving to convergent results in a fraction of the time employed by the original environment.
☆ Model Metamers Reveal Invariances in Graph Neural Networks
In recent years, deep neural networks have been extensively employed in perceptual systems to learn representations endowed with invariances, aiming to emulate the invariance mechanisms observed in the human brain. However, studies in the visual and auditory domains have confirmed that significant gaps remain between the invariance properties of artificial neural networks and those of humans. To investigate the invariance behavior within graph neural networks (GNNs), we introduce a model ``metamers'' generation technique. By optimizing input graphs such that their internal node activations match those of a reference graph, we obtain graphs that are equivalent in the model's representation space, yet differ significantly in both structure and node features. Our theoretical analysis focuses on two aspects: the local metamer dimension for a single node and the activation-induced volume change of the metamer manifold. Utilizing this approach, we uncover extreme levels of representational invariance across several classic GNN architectures. Although targeted modifications to model architecture and training strategies can partially mitigate this excessive invariance, they fail to fundamentally bridge the gap to human-like invariance. Finally, we quantify the deviation between metamer graphs and their original counterparts, revealing unique failure modes of current GNNs and providing a complementary benchmark for model evaluation.
☆ Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots NeurIPS 2025
Robotic systems are increasingly expected to operate in human-centered, unstructured environments where safety, adaptability, and generalization are essential. Vision-Language-Action (VLA) models have been proposed as a language guided generalized control framework for real robots. However, their deployment has been limited to conventional serial link manipulators. Coupled by their rigidity and unpredictability of learning based control, the ability to safely interact with the environment is missing yet critical. In this work, we present the deployment of a VLA model on a soft continuum manipulator to demonstrate autonomous safe human-robot interaction. We present a structured finetuning and deployment pipeline evaluating two state-of-the-art VLA models (OpenVLA-OFT and $\pi_0$) across representative manipulation tasks, and show while out-of-the-box policies fail due to embodiment mismatch, through targeted finetuning the soft robot performs equally to the rigid counterpart. Our findings highlight the necessity of finetuning for bridging embodiment gaps, and demonstrate that coupling VLA models with soft robots enables safe and flexible embodied AI in human-shared environments.
comment: Accepted by NeurIPS 2025 SpaVLE workshop. 4 pages, 2 figures(in main paper, excluding references and supplements)
☆ Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs NeurIPS 2025
Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.
comment: NeurIPS 2025
☆ M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
comment: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures
Localist LLMs with Recruitment Learning
We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovations are (1) a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining, (2) an information-theoretic recruitment mechanism that adaptively allocates semantic blocks as needed, eliminating the requirement for complete domain knowledge at initialization, and (3) a hierarchical recruitment framework that extends capacity allocation to entire specialized LLMs, enabling multi-granularity architectural adaptation. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, dynamic rule injection, and principled recruitment criteria based on penalized likelihood with explicit units. We provide rigorous mathematical results establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks at stationary points, with exact bounds on attention entropy and pointer fidelity. The hierarchical recruitment mechanism provides convergence guarantees at both the block level (fine-grained, within-LLM) and the LLM level (coarse-grained, cross-domain), ensuring the system discovers semantic partitions that balance model complexity against data encoding efficiency. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes while adapting architectural capacity at multiple granularities, supporting applications in regulated domains requiring both transparency and capability.
☆ Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.
comment: This work is in progress
☆ Optimal Best Arm Identification under Differential Privacy
Best Arm Identification (BAI) algorithms are deployed in data-sensitive applications, such as adaptive clinical trials or user studies. Driven by the privacy concerns of these applications, we study the problem of fixed-confidence BAI under global Differential Privacy (DP) for Bernoulli distributions. While numerous asymptotically optimal BAI algorithms exist in the non-private setting, a significant gap remains between the best lower and upper bounds in the global DP setting. This work reduces this gap to a small multiplicative constant, for any privacy budget $\epsilon$. First, we provide a tighter lower bound on the expected sample complexity of any $\delta$-correct and $\epsilon$-global DP strategy. Our lower bound replaces the Kullback-Leibler (KL) divergence in the transportation cost used by the non-private characteristic time with a new information-theoretic quantity that optimally trades off between the KL divergence and the Total Variation distance scaled by $\epsilon$. Second, we introduce a stopping rule based on these transportation costs and a private estimator of the means computed using an arm-dependent geometric batching. En route to proving the correctness of our stopping rule, we derive concentration results of independent interest for the Laplace distribution and for the sum of Bernoulli and Laplace distributions. Third, we propose a Top Two sampling rule based on these transportation costs. For any budget $\epsilon$, we show an asymptotic upper bound on its expected sample complexity that matches our lower bound to a multiplicative constant smaller than $8$. Our algorithm outperforms existing $\delta$-correct and $\epsilon$-global DP BAI algorithms for different values of $\epsilon$.
comment: 92 pages, 2 figures, 2 tables. To be published in the Thirty-Ninth Annual Conference on Neural Information Processing Systems
☆ Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling
Reward models are essential for aligning Large Language Models (LLMs) with human values, yet their development is hampered by costly preference datasets and poor interpretability. While recent rubric-based approaches offer transparency, they often lack systematic quality control and optimization, creating a trade-off between scalability and reliability. We address these limitations with a novel, training-free framework built on a key assumption: \textit{evaluation rubrics underlying human preferences exhibit significant generalization ability across diverse queries}, a property that enables remarkable data efficiency. Our two-stage approach first infers high-quality, query-specific rubrics using a validation-guided \textbf{Propose-Evaluate-Revise} pipeline. Second, it generalizes these granular rubrics into a compact, non-redundant core set by maximizing an \textbf{information-theoretic coding rate}. The final output is an interpretable, hierarchical "Theme-Tips" rubric set. Extensive experiments demonstrate the framework's exceptional data efficiency and performance. Critically, using just 70 preference pairs (1.5\% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts. This work pioneers a scalable, interpretable, and data-efficient path for reward modeling.
☆ Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Learning disentangled representations in sequential data is a key goal in deep learning, with broad applications in vision, audio, and time series. While real-world data involves multiple interacting semantic factors over time, prior work has mostly focused on simpler two-factor static and dynamic settings, primarily because such settings make data collection easier, thereby overlooking the inherently multi-factor nature of real-world data. We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets spanning video, audio, and time series. Our benchmark includes modular tools for dataset integration, model development, and evaluation metrics tailored to multi-factor analysis. We additionally propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results. Moreover, we show that Vision-Language Models can automate dataset annotation and serve as zero-shot disentanglement evaluators, removing the need for manual labels and human intervention. Together, these contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement.
☆ Symmetries in PAC-Bayesian Learning
Symmetries are known to improve the empirical performance of machine learning models, yet theoretical guarantees explaining these gains remain limited. Prior work has focused mainly on compact group symmetries and often assumes that the data distribution itself is invariant, an assumption rarely satisfied in real-world applications. In this work, we extend generalization guarantees to the broader setting of non-compact symmetries, such as translations and to non-invariant data distributions. Building on the PAC-Bayes framework, we adapt and tighten existing bounds, demonstrating the approach on McAllester's PAC-Bayes bound while showing that it applies to a wide range of PAC-Bayes bounds. We validate our theory with experiments on a rotated MNIST dataset with a non-uniform rotation group, where the derived guarantees not only hold but also improve upon prior results. These findings provide theoretical evidence that, for symmetric data, symmetric models are preferable beyond the narrow setting of compact groups and invariant distributions, opening the way to a more general understanding of symmetries in machine learning.
☆ MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.
☆ Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent communications to ensure that all agent invocations are "related to" and "likely to further" the original objective. We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context. We then propose, implement, and evaluate ControlValve, a new defense inspired by the principles of control-flow integrity and least privilege. ControlValve (1) generates permitted control-flow graphs for multi-agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero-shot manner) for each agent invocation.
☆ Uncertainty-aware data assimilation through variational inference
Data assimilation, consisting in the combination of a dynamical model with a set of noisy and incomplete observations in order to infer the state of a system over time, involves uncertainty in most settings. Building upon an existing deterministic machine learning approach, we propose a variational inference-based extension in which the predicted state follows a multivariate Gaussian distribution. Using the chaotic Lorenz-96 dynamics as a testing ground, we show that our new model enables to obtain nearly perfectly calibrated predictions, and can be integrated in a wider variational data assimilation pipeline in order to achieve greater benefit from increasing lengths of data assimilation windows. Our code is available at https://github.com/anthony-frion/Stochastic_CODA.
☆ Adaptive Discretization for Consistency Models NeurIPS 2025
Consistency Models (CMs) have shown promise for efficient one-step generation. However, most existing CMs rely on manually designed discretization schemes, which can cause repeated adjustments for different noise schedules and datasets. To address this, we propose a unified framework for the automatic and adaptive discretization of CMs, formulating it as an optimization problem with respect to the discretization step. Concretely, during the consistency training process, we propose using local consistency as the optimization objective to ensure trainability by avoiding excessive discretization, and taking global consistency as a constraint to ensure stability by controlling the denoising error in the training target. We establish the trade-off between local and global consistency with a Lagrange multiplier. Building on this framework, we achieve adaptive discretization for CMs using the Gauss-Newton method. We refer to our approach as ADCMs. Experiments demonstrate that ADCMs significantly improve the training efficiency of CMs, achieving superior generative performance with minimal training overhead on both CIFAR-10 and ImageNet. Moreover, ADCMs exhibit strong adaptability to more advanced DM variants. Code is available at https://github.com/rainstonee/ADCM.
comment: Accepted by NeurIPS 2025
☆ Fair and Interpretable Deepfake Detection in Videos
Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.
comment: 10 pages (including References)
☆ High-Level Multi-Robot Trajectory Planning And Spurious Behavior Detection
The reliable execution of high-level missions in multi-robot systems with heterogeneous agents, requires robust methods for detecting spurious behaviors. In this paper, we address the challenge of identifying spurious executions of plans specified as a Linear Temporal Logic (LTL) formula, as incorrect task sequences, violations of spatial constraints, timing inconsis- tencies, or deviations from intended mission semantics. To tackle this, we introduce a structured data generation framework based on the Nets-within-Nets (NWN) paradigm, which coordinates robot actions with LTL-derived global mission specifications. We further propose a Transformer-based anomaly detection pipeline that classifies robot trajectories as normal or anomalous. Experi- mental evaluations show that our method achieves high accuracy (91.3%) in identifying execution inefficiencies, and demonstrates robust detection capabilities for core mission violations (88.3%) and constraint-based adaptive anomalies (66.8%). An ablation experiment of the embedding and architecture was carried out, obtaining successful results where our novel proposition performs better than simpler representations.
comment: 6 pages,3 figures, Iberian Robotics Conference 2025
☆ A Prototypical Network with an Attention-based Encoder for Drivers Identification Application
Driver identification has become an area of increasing interest in recent years, especially for data- driven applications, because biometric-based technologies may incur privacy issues. This study proposes a deep learning neural network architecture, an attention-based encoder (AttEnc), which uses an attention mechanism for driver identification and uses fewer model parameters than current methods. Most studies do not address the issue of data shortages for driver identification, and most of them are inflexible when encountering unknown drivers. In this study, an architecture that combines a prototypical network and an attention-based encoder (P-AttEnc) is proposed. It applies few-shot learning to overcome the data shortage issues and to enhance model generalizations. The experiments showed that the attention-based encoder can identify drivers with accuracies of 99.3%, 99.0% and 99.9% in three different datasets and has a prediction time that is 44% to 79% faster because it significantly reduces, on average, 87.6% of the model parameters. P-AttEnc identifies drivers based on few shot data, extracts driver fingerprints to address the issue of data shortages, and is able to classify unknown drivers. The first experiment showed that P-AttEnc can identify drivers with an accuracy of 69.8% in the one-shot scenario. The second experiment showed that P-AttEnc, in the 1-shot scenario, can classify unknown drivers with an average accuracy of 65.7%.
☆ Diagnosis of Fuel Cell Health Status with Deep Sparse Auto-Encoder Neural Network
Effective and accurate diagnosis of fuel cell health status is crucial for ensuring the stable operation of fuel cell stacks. Among various parameters, high-frequency impedance serves as a critical indicator for assessing fuel cell state and health conditions. However, its online testing is prohibitively complex and costly. This paper employs a deep sparse auto-encoding network for the prediction and classification of high-frequency impedance in fuel cells, achieving metric of accuracy rate above 92\%. The network is further deployed on an FPGA, attaining a hardware-based recognition rate almost 90\%.
☆ D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks
Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments on locomotion and manipulation benchmarks with high risks of failure demonstrate that our method outperforms baselines, underscoring the importance of explicitly modeling multimodality and risk in RL.
☆ Temporally Detailed Hypergraph Neural ODEs for Type 2 Diabetes Progression Modeling
Disease progression modeling aims to characterize and predict how a patient's disease complications worsen over time based on longitudinal electronic health records (EHRs). Accurate modeling of disease progression, such as type 2 diabetes, can enhance patient sub-phenotyping and inform effective and timely interventions. However, the problem is challenging due to the need to learn continuous-time dynamics of progression patterns based on irregular-time event samples and patient heterogeneity (\eg different progression rates and pathways). Existing mechanistic and data-driven methods either lack adaptability to learn from real-world data or fail to capture complex continuous-time dynamics on progression trajectories. To address these limitations, we propose Temporally Detailed Hypergraph Neural Ordinary Differential Equation (TD-HNODE), which represents disease progression on clinically recognized trajectories as a temporally detailed hypergraph and learns the continuous-time progression dynamics via a neural ODE framework. TD-HNODE contains a learnable TD-Hypergraph Laplacian that captures the interdependency of disease complication markers within both intra- and inter-progression trajectories. Experiments on two real-world clinical datasets demonstrate that TD-HNODE outperforms multiple baselines in modeling the progression of type 2 diabetes and related cardiovascular diseases.
☆ Soft-Masked Diffusion Language Models
Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.
☆ Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
comment: Preprint. Work in progress
☆ SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
Transformers have shown remarkable performance in both natural language processing (NLP) and computer vision (CV) tasks. However, their real-time inference speed and efficiency are limited due to the inefficiency in Softmax and Layer Normalization (LayerNorm). Previous works based on function approximation suffer from inefficient implementation as they place emphasis on computation while disregarding memory overhead concerns. Moreover, such methods rely on retraining to compensate for approximation error which can be costly and inconvenient. In this paper, we present SOLE, a hardware-software co-design for Softmax and LayerNorm which is composed of E2Softmax and AILayerNorm. E2Softmax utilizes log2 quantization of exponent function and log-based division to approximate Softmax while AILayerNorm adopts low-precision statistic calculation. Compared with state-of-the-art designs, we achieve both low-precision calculation and low bit-width storage on Softmax and LayerNorm. Experiments show that SOLE maintains inference accuracy without retraining while offering orders of magnitude speedup and energy savings over GPU, achieving 3.04x, 3.86x energy-efficiency improvements and 2.82x, 3.32x area-efficiency improvements over prior state-of-the-art custom hardware for Softmax and LayerNorm, respectively.
☆ A Standardized Benchmark for Machine-Learned Molecular Dynamics using Weighted Ensemble Sampling
The rapid evolution of molecular dynamics (MD) methods, including machine-learned dynamics, has outpaced the development of standardized tools for method validation. Objective comparison between simulation approaches is often hindered by inconsistent evaluation metrics, insufficient sampling of rare conformational states, and the absence of reproducible benchmarks. To address these challenges, we introduce a modular benchmarking framework that systematically evaluates protein MD methods using enhanced sampling analysis. Our approach uses weighted ensemble (WE) sampling via The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis (WESTPA), based on progress coordinates derived from Time-lagged Independent Component Analysis (TICA), enabling fast and efficient exploration of protein conformational space. The framework includes a flexible, lightweight propagator interface that supports arbitrary simulation engines, allowing both classical force fields and machine learning-based models. Additionally, the framework offers a comprehensive evaluation suite capable of computing more than 19 different metrics and visualizations across a variety of domains. We further contribute a dataset of nine diverse proteins, ranging from 10 to 224 residues, that span a variety of folding complexities and topologies. Each protein has been extensively simulated at 300K for one million MD steps per starting point (4 ns). To demonstrate the utility of our framework, we perform validation tests using classic MD simulations with implicit solvent and compare protein conformational sampling using a fully trained versus under-trained CGSchNet model. By standardizing evaluation protocols and enabling direct, reproducible comparisons across MD approaches, our open-source platform lays the groundwork for consistent, rigorous benchmarking across the molecular simulation community.
comment: 37 Pages (Main Text), 10 Figures, Submitted to Journal of Physical Chemistry B
☆ Robustness in Text-Attributed Graph Learning: Insights, Trade-offs, and New Defenses
While Graph Neural Networks (GNNs) and Large Language Models (LLMs) are powerful approaches for learning on Text-Attributed Graphs (TAGs), a comprehensive understanding of their robustness remains elusive. Current evaluations are fragmented, failing to systematically investigate the distinct effects of textual and structural perturbations across diverse models and attack scenarios. To address these limitations, we introduce a unified and comprehensive framework to evaluate robustness in TAG learning. Our framework evaluates classical GNNs, robust GNNs (RGNNs), and GraphLLMs across ten datasets from four domains, under diverse text-based, structure-based, and hybrid perturbations in both poisoning and evasion scenarios. Our extensive analysis reveals multiple findings, among which three are particularly noteworthy: 1) models have inherent robustness trade-offs between text and structure, 2) the performance of GNNs and RGNNs depends heavily on the text encoder and attack type, and 3) GraphLLMs are particularly vulnerable to training data corruption. To overcome the identified trade-offs, we introduce SFT-auto, a novel framework that delivers superior and balanced robustness against both textual and structural attacks within a single model. Our work establishes a foundation for future research on TAG security and offers practical solutions for robust TAG learning in adversarial environments. Our code is available at: https://github.com/Leirunlin/TGRB.
☆ QRïS: A Preemptive Novel Method for Quishing Detection Through Structural Features of QR
Globally, individuals and organizations employ Quick Response (QR) codes for swift and convenient communication. Leveraging this, cybercriminals embed falsify and misleading information in QR codes to launch various phishing attacks which termed as Quishing. Many former studies have introduced defensive approaches to preclude Quishing such as by classifying the embedded content of QR codes and then label the QR codes accordingly, whereas other studies classify them using visual features (i.e., deep features, histogram density analysis features). However, these approaches mainly rely on black-box techniques which do not clearly provide interpretability and transparency to fully comprehend and reproduce the intrinsic decision process; therefore, having certain obvious limitations includes the approaches' trust, accountability, issues in bias detection, and many more. We proposed QR\"iS, the pioneer method to classify QR codes through the comprehensive structural analysis of a QR code which helps to identify phishing QR codes beforehand. Our classification method is clearly transparent which makes it reproducible, scalable, and easy to comprehend. First, we generated QR codes dataset (i.e. 400,000 samples) using recently published URLs datasets [1], [2]. Then, unlike black-box models, we developed a simple algorithm to extract 24 structural features from layout patterns present in QR codes. Later, we train the machine learning models on the harvested features and obtained accuracy of up to 83.18%. To further evaluate the effectiveness of our approach, we perform the comparative analysis of proposed method with relevant contemporary studies. Lastly, for real-world deployment and validation, we developed a mobile app which assures the feasibility of the proposed solution in real-world scenarios which eventually strengthen the applicability of the study.
comment: 13 pages, 11 figures, and 7 tables
☆ ALPINE: A Lightweight and Adaptive Privacy-Decision Agent Framework for Dynamic Edge Crowdsensing WWW 2026
Mobile edge crowdsensing (MECS) systems continuously generate and transmit user data in dynamic, resource-constrained environments, exposing users to significant privacy threats. In practice, many privacy-preserving mechanisms build on differential privacy (DP). However, static DP mechanisms often fail to adapt to evolving risks, for example, shifts in adversarial capabilities, resource constraints and task requirements, resulting in either excessive noise or inadequate protection. To address this challenge, we propose ALPINE, a lightweight, adaptive framework that empowers terminal devices to autonomously adjust differential privacy levels in real time. ALPINE operates as a closed-loop control system consisting of four modules: dynamic risk perception, privacy decision via twin delayed deep deterministic policy gradient (TD3), local privacy execution and performance verification from edge nodes. Based on environmental risk assessments, we design a reward function that balances privacy gains, data utility and energy cost, guiding the TD3 agent to adaptively tune noise magnitude across diverse risk scenarios and achieve a dynamic equilibrium among privacy, utility and cost. Both the collaborative risk model and pretrained TD3-based agent are designed for low-overhead deployment. Extensive theoretical analysis and real-world simulations demonstrate that ALPINE effectively mitigates inference attacks while preserving utility and cost, making it practical for large-scale edge applications.
comment: 12 pages, 8 figures, 4 tables. Submitted to The Web Conference (WWW 2026)
☆ Learning After Model Deployment
In classic supervised learning, once a model is deployed in an application, it is fixed. No updates will be made to it during the application. This is inappropriate for many dynamic and open environments, where unexpected samples from unseen classes may appear. In such an environment, the model should be able to detect these novel samples from unseen classes and learn them after they are labeled. We call this paradigm Autonomous Learning after Model Deployment (ALMD). The learning here is continuous and involves no human engineers. Labeling in this scenario is performed by human co-workers or other knowledgeable agents, which is similar to what humans do when they encounter an unfamiliar object and ask another person for its name. In ALMD, the detection of novel samples is dynamic and differs from traditional out-of-distribution (OOD) detection in that the set of in-distribution (ID) classes expands as new classes are learned during application, whereas ID classes is fixed in traditional OOD detection. Learning is also different from classic supervised learning because in ALMD, we learn the encountered new classes immediately and incrementally. It is difficult to retrain the model from scratch using all the past data from the ID classes and the novel samples from newly discovered classes, as this would be resource- and time-consuming. Apart from these two challenges, ALMD faces the data scarcity issue because instances of new classes often appear sporadically in real-life applications. To address these issues, we propose a novel method, PLDA, which performs dynamic OOD detection and incremental learning of new classes on the fly. Empirical evaluations will demonstrate the effectiveness of PLDA.
comment: Published at ECAI-2025
☆ HyperSearch: Prediction of New Hyperedges through Unconstrained yet Efficient Search ICDM
Higher-order interactions (HOIs) in complex systems, such as scientific collaborations, multi-protein complexes, and multi-user communications, are commonly modeled as hypergraphs, where each hyperedge (i.e., a subset of nodes) represents an HOI among the nodes. Given a hypergraph, hyperedge prediction aims to identify hyperedges that are either missing or likely to form in the future, and it has broad applications, including recommending interest-based social groups, predicting collaborations, and uncovering functional complexes in biological systems. However, the vast search space of hyperedge candidates (i.e., all possible subsets of nodes) poses a significant computational challenge, making naive exhaustive search infeasible. As a result, existing approaches rely on either heuristic sampling to obtain constrained candidate sets or ungrounded assumptions on hypergraph structure to select promising hyperedges. In this work, we propose HyperSearch, a search-based algorithm for hyperedge prediction that efficiently evaluates unconstrained candidate sets, by incorporating two key components: (1) an empirically grounded scoring function derived from observations in real-world hypergraphs and (2) an efficient search mechanism, where we derive and use an anti-monotonic upper bound of the original scoring function (which is not antimonotonic) to prune the search space. This pruning comes with theoretical guarantees, ensuring that discarded candidates are never better than the kept ones w.r.t. the original scoring function. In extensive experiments on 10 real-world hypergraphs across five domains, HyperSearch consistently outperforms state-of-the-art baselines, achieving higher accuracy in predicting new (i.e., not in the training set) hyperedges.
comment: IEEE International Conference on Data Mining (ICDM) 2025
☆ In-situ Autoguidance: Eliciting Self-Correction in Diffusion Models ICML 2025
The generation of high-quality, diverse, and prompt-aligned images is a central goal in image-generating diffusion models. The popular classifier-free guidance (CFG) approach improves quality and alignment at the cost of reduced variation, creating an inherent entanglement of these effects. Recent work has successfully disentangled these properties by guiding a model with a separately trained, inferior counterpart; however, this solution introduces the considerable overhead of requiring an auxiliary model. We challenge this prerequisite by introducing In-situ Autoguidance, a method that elicits guidance from the model itself without any auxiliary components. Our approach dynamically generates an inferior prediction on the fly using a stochastic forward pass, reframing guidance as a form of inference-time self-correction. We demonstrate that this zero-cost approach is not only viable but also establishes a powerful new baseline for cost-efficient guidance, proving that the benefits of self-guidance can be achieved without external models.
comment: 6 pages, 3 figures. ICML 2025 Workshop submission
☆ Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction
Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel. In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred. This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation? We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction. The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization. All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation. Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.
☆ Continuous Q-Score Matching: Diffusion Guided Reinforcement Learning for Continuous-Time Control
Reinforcement learning (RL) has achieved significant success across a wide range of domains, however, most existing methods are formulated in discrete time. In this work, we introduce a novel RL method for continuous-time control, where stochastic differential equations govern state-action dynamics. Departing from traditional value function-based approaches, our key contribution is the characterization of continuous-time Q-functions via a martingale condition and the linking of diffusion policy scores to the action gradient of a learned continuous Q-function by the dynamic programming principle. This insight motivates Continuous Q-Score Matching (CQSM), a score-based policy improvement algorithm. Notably, our method addresses a long-standing challenge in continuous-time RL: preserving the action-evaluation capability of Q-functions without relying on time discretization. We further provide theoretical closed-form solutions for linear-quadratic (LQ) control problems within our framework. Numerical results in simulated environments demonstrate the effectiveness of our proposed method and compare it to popular baselines.
☆ Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation
We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.
Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.
☆ Verification-Aware Planning for Multi-Agent Systems
Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.
comment: Submission for ARR Oct
☆ Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling
Transformers have achieved remarkable success in time series modeling, yet their internal mechanisms remain opaque. This work demystifies the Transformer encoder by establishing its fundamental equivalence to a Graph Convolutional Network (GCN). We show that in the forward pass, the attention distribution matrix serves as a dynamic adjacency matrix, and its composition with subsequent transformations performs computations analogous to graph convolution. Moreover, we demonstrate that in the backward pass, the update dynamics of value and feed-forward projections mirror those of GCN parameters. Building on this unified theoretical reinterpretation, we propose \textbf{Fighter} (Flexible Graph Convolutional Transformer), a streamlined architecture that removes redundant linear projections and incorporates multi-hop graph aggregation. This perspective yields an explicit and interpretable representation of temporal dependencies across different scales, naturally expressed as graph edges. Experiments on standard forecasting benchmarks confirm that Fighter achieves competitive performance while providing clearer mechanistic interpretability of its predictions.
comment: Preprint
☆ Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.
comment: 49 pages
☆ On the Universal Near Optimality of Hedge in Combinatorial Settings
In this paper, we study the classical Hedge algorithm in combinatorial settings. In each round, the learner selects a vector $\boldsymbol{x}_t$ from a set $X \subseteq \{0,1\}^d$, observes a full loss vector $\boldsymbol{y}_t \in \mathbb{R}^d$, and incurs a loss $\langle \boldsymbol{x}_t, \boldsymbol{y}_t \rangle \in [-1,1]$. This setting captures several important problems, including extensive-form games, resource allocation, $m$-sets, online multitask learning, and shortest-path problems on directed acyclic graphs (DAGs). It is well known that Hedge achieves a regret of $O\big(\sqrt{T \log |X|}\big)$ after $T$ rounds of interaction. In this paper, we ask whether Hedge is optimal across all combinatorial settings. To that end, we show that for any $X \subseteq \{0,1\}^d$, Hedge is near-optimal--specifically, up to a $\sqrt{\log d}$ factor--by establishing a lower bound of $\Omega\big(\sqrt{T \log(|X|)/\log d}\big)$ that holds for any algorithm. We then identify a natural class of combinatorial sets--namely, $m$-sets with $\log d \leq m \leq \sqrt{d}$--for which this lower bound is tight, and for which Hedge is provably suboptimal by a factor of exactly $\sqrt{\log d}$. At the same time, we show that Hedge is optimal for online multitask learning, a generalization of the classical $K$-experts problem. Finally, we leverage the near-optimality of Hedge to establish the existence of a near-optimal regularizer for online shortest-path problems in DAGs--a setting that subsumes a broad range of combinatorial domains. Specifically, we show that the classical Online Mirror Descent (OMD) algorithm, when instantiated with the dilated entropy regularizer, is iterate-equivalent to Hedge, and therefore inherits its near-optimal regret guarantees for DAGs.
comment: 28 pages, 1 Figure
☆ Explainable Heterogeneous Anomaly Detection in Financial Networks via Adaptive Expert Routing
Financial anomalies exhibit heterogeneous mechanisms (price shocks, liquidity freezes, contagion cascades, regime shifts), but existing detectors treat all anomalies uniformly, producing scalar scores without revealing which mechanism is failing, where risks concentrate, or how to intervene. This opacity prevents targeted regulatory responses. Three unsolved challenges persist: (1) static graph structures cannot adapt when market correlations shift during regime changes; (2) uniform detection mechanisms miss type-specific signatures across multiple temporal scales while failing to integrate individual behaviors with network contagion; (3) black-box outputs provide no actionable guidance on anomaly mechanisms or their temporal evolution. We address these via adaptive graph learning with specialized expert networks that provide built-in interpretability. Our framework captures multi-scale temporal dependencies through BiLSTM with self-attention, fuses temporal and spatial information via cross-modal attention, learns dynamic graphs through neural multi-source interpolation, adaptively balances learned dynamics with structural priors via stress-modulated fusion, routes anomalies to four mechanism-specific experts, and produces dual-level interpretable attributions. Critically, interpretability is embedded architecturally rather than applied post-hoc. On 100 US equities (2017-2024), we achieve 92.3% detection of 13 major events with 3.8-day lead time, outperforming best baseline by 30.8pp. Silicon Valley Bank case study demonstrates anomaly evolution tracking: Price-Shock expert weight rose to 0.39 (33% above baseline 0.29) during closure, peaking at 0.48 (66% above baseline) one week later, revealing automatic temporal mechanism identification without labeled supervision.
☆ Data Reliability Scoring
How can we assess the reliability of a dataset without access to ground truth? We introduce the problem of reliability scoring for datasets collected from potentially strategic sources. The true data are unobserved, but we see outcomes of an unknown statistical experiment that depends on them. To benchmark reliability, we define ground-truth-based orderings that capture how much reported data deviate from the truth. We then propose the Gram determinant score, which measures the volume spanned by vectors describing the empirical distribution of the observed data and experiment outcomes. We show that this score preserves several ground-truth based reliability orderings and, uniquely up to scaling, yields the same reliability ranking of datasets regardless of the experiment -- a property we term experiment agnosticism. Experiments on synthetic noise models, CIFAR-10 embeddings, and real employment data demonstrate that the Gram determinant score effectively captures data quality across diverse observation processes.
comment: 39 pages, 5 figures
☆ DFNN: A Deep Fréchet Neural Network Framework for Learning Metric-Space-Valued Responses
Regression with non-Euclidean responses -- e.g., probability distributions, networks, symmetric positive-definite matrices, and compositions -- has become increasingly important in modern applications. In this paper, we propose deep Fr\'echet neural networks (DFNNs), an end-to-end deep learning framework for predicting non-Euclidean responses -- which are considered as random objects in a metric space -- from Euclidean predictors. Our method leverages the representation-learning power of deep neural networks (DNNs) to the task of approximating conditional Fr\'echet means of the response given the predictors, the metric-space analogue of conditional expectations, by minimizing a Fr\'echet risk. The framework is highly flexible, accommodating diverse metrics and high-dimensional predictors. We establish a universal approximation theorem for DFNNs, advancing the state-of-the-art of neural network approximation theory to general metric-space-valued responses without making model assumptions or relying on local smoothing. Empirical studies on synthetic distributional and network-valued responses, as well as a real-world application to predicting employment occupational compositions, demonstrate that DFNNs consistently outperform existing methods.
☆ Convergence of Regret Matching in Potential Games and Constrained Optimization
Regret matching (RM} -- and its modern variants -- is a foundational online algorithm that has been at the heart of many AI breakthrough results in solving benchmark zero-sum games, such as poker. Yet, surprisingly little is known so far in theory about its convergence beyond two-player zero-sum games. For example, whether regret matching converges to Nash equilibria in potential games has been an open problem for two decades. Even beyond games, one could try to use RM variants for general constrained optimization problems. Recent empirical evidence suggests that they -- particularly regret matching$^+$ (RM$^+$) -- attain strong performance on benchmark constrained optimization problems, outperforming traditional gradient descent-type algorithms. We show that alternating RM$^+$ converges to an $\epsilon$-KKT point after $O_\epsilon(1/\epsilon^4)$ iterations, establishing for the first time that it is a sound and fast first-order optimizer. Our argument relates the KKT gap to the accumulated regret, two quantities that are entirely disparate in general but interact in an intriguing way in our setting, so much so that when regrets are bounded, our complexity bound improves all the way to $O_\epsilon(1/\epsilon^2)$. From a technical standpoint, while RM$^+$ does not have the usual one-step improvement property in general, we show that it does in a certain region that the algorithm will quickly reach and remain in thereafter. In sharp contrast, our second main result establishes a lower bound: RM, with or without alternation, can take an exponential number of iterations to reach a crude approximate solution even in two-player potential games. This represents the first worst-case separation between RM and RM$^+$. Our lower bound shows that convergence to coarse correlated equilibria in potential games is exponentially faster than convergence to Nash equilibria.
☆ Mode Collapse of Mean-Field Variational Inference
Mean-field variational inference (MFVI) is a widely used method for approximating high-dimensional probability distributions by product measures. It has been empirically observed that MFVI optimizers often suffer from mode collapse. Specifically, when the target measure $\pi$ is a mixture $\pi = w P_0 + (1 - w) P_1$, the MFVI optimizer tends to place most of its mass near a single component of the mixture. This work provides the first theoretical explanation of mode collapse in MFVI. We introduce the notion to capture the separatedness of the two mixture components -- called $\varepsilon$-separateness -- and derive explicit bounds on the fraction of mass that any MFVI optimizer assigns to each component when $P_0$ and $P_1$ are $\varepsilon$-separated for sufficiently small $\varepsilon$. Our results suggest that the occurrence of mode collapse crucially depends on the relative position of the components. To address this issue, we propose the rotational variational inference (RoVI), which augments MFVI with a rotation matrix. The numerical studies support our theoretical findings and demonstrate the benefits of RoVI.
☆ Consistent Zero-Shot Imitation with Contrastive Goal Inference
In the same way that generative models today conduct most of their training in a self-supervised fashion, how can agentic models conduct their training in a self-supervised fashion, interactively exploring, learning, and preparing to quickly adapt to new tasks? A prerequisite for embodied agents deployed in real world interactions ought to be training with interaction, yet today's most successful AI models (e.g., VLMs, LLMs) are trained without an explicit notion of action. The problem of pure exploration (which assumes no data as input) is well studied in the reinforcement learning literature and provides agents with a wide array of experiences, yet it fails to prepare them for rapid adaptation to new tasks. Today's language and vision models are trained on data provided by humans, which provides a strong inductive bias for the sorts of tasks that the model will have to solve (e.g., modeling chords in a song, phrases in a sonnet, sentences in a medical record). However, when they are prompted to solve a new task, there is a faulty tacit assumption that humans spend most of their time in the most rewarding states. The key contribution of our paper is a method for pre-training interactive agents in a self-supervised fashion, so that they can instantly mimic human demonstrations. Our method treats goals (i.e., observations) as the atomic construct. During training, our method automatically proposes goals and practices reaching them, building off prior work in reinforcement learning exploration. During evaluation, our method solves an (amortized) inverse reinforcement learning problem to explain demonstrations as optimal goal-reaching behavior. Experiments on standard benchmarks (not designed for goal-reaching) show that our approach outperforms prior methods for zero-shot imitation.
☆ Bitwidth-Specific Logarithmic Arithmetic for Future Hardware-Accelerated Training
While advancements in quantization have significantly reduced the computational costs of inference in deep learning, training still predominantly relies on complex floating-point arithmetic. Low-precision fixed-point training presents a compelling alternative. This work introduces a novel enhancement in low-precision logarithmic fixed-point training, geared towards future hardware accelerator designs. We propose incorporating bitwidth in the design of approximations to arithmetic operations. To this end, we introduce a new hardware-friendly, piece-wise linear approximation for logarithmic addition. Using simulated annealing, we optimize this approximation at different precision levels. A C++ bit-true simulation demonstrates training of VGG-11 and VGG-16 models on CIFAR-100 and TinyImageNet, respectively, using 12-bit integer arithmetic with minimal accuracy degradation compared to 32-bit floating-point training. Our hardware study reveals up to 32.5% reduction in area and 53.5% reduction in energy consumption for the proposed LNS multiply-accumulate units compared to that of linear fixed-point equivalents.
☆ The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs
The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models. In turn, this has led to investigation of CoT monitoring as a compelling method for detecting harmful behaviors such as reward hacking, under the assumption that models' reasoning processes reflect their internal decision-making. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors like sycophancy, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning -- generating plausible-sounding justifications for violating their instructions while downplaying potential harms. Beyond being an interesting property of training, we find that while motivated reasoning can be detected by most frontier reasoning models, smaller LLM judges can fail to identify a portion of it, and in rare cases can themselves be persuaded that the reasoning is correct, despite it contradicting clear instructions. This capability gap raises concerns that as models become more sophisticated, their motivated reasoning may become increasingly difficult for monitors to detect. Our results underscore the need to account for motivated reasoning when relying on chain-of-thought processes for model evaluation and oversight. All code for this paper will be made available. WARNING: some examples in this paper may be upsetting.
comment: 26 pages
☆ Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.
comment: 7 pages, 1 figure, FLTA 2025
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks and continues to attract significant attention. Existing RLVR methods, however, typically treat all tokens uniformly without accounting for token-level advantages. These methods primarily evaluate performance based on final answer correctness or Pass@K accuracy, and yet make claims about RL post-training leading to improved reasoning traces. This motivates our investigation into the effect of RL post-training on intermediate tokens which are not directly incentivized. To study this, we design an experimental setup using the GRPO algorithm with Qwen-2.5-0.5B model on the GSM8K dataset. We introduce trace coherence, a First-Order Logic (FOL)-based measure to capture the consistency of reasoning steps by identifying errors in the traces. We distinguish between trace validity and trace coherence, noting that the former implies logical soundness while the latter measures local coherence via lack of errors. Our results show that RL post-training overall improves trace coherence with the most significant gains on problems where the base model fails but the RL model succeeds. Surprisingly, RL enhances local coherence without necessarily producing valid or correct solutions. This highlights a crucial distinction: improved local coherence in reasoning steps does not guarantee final answer correctness. We argue that claims of improved reasoning via RL must be examined with care, as these may be based on improved trace coherence, which may not translate into fully valid mathematical proofs.
comment: 4 pages, 2 figures
☆ AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI NeurIPS 2025
Goal changes are a defining feature of real world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate several frontier models and uncover sharp contrasts obscured by traditional $\text{pass}@k$ scores: for example, GPT-4o reaches $92.2\%$ recovery on airline booking shifts while Gemini collapses to $48.6\%$, and retail tasks show near perfect parameter validity yet redundancy rates above $80\%$, revealing major inefficiencies. These findings demonstrate that high raw accuracy does not imply robustness under dynamic goals, and that explicit measurement of recovery time and redundancy is essential. AgentChangeBench establishes a reproducible testbed for diagnosing and improving agent resilience in realistic enterprise settings.
comment: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Multi-Turn Interactions in Large Language Models
☆ Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, the performance of DLMs on code generation tasks, which have stronger structural constraints, is significantly hampered by the critical trade-off between inference speed and output quality. We observed that accelerating the code generation process by reducing the number of sampling steps usually leads to a catastrophic collapse in performance. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs to achieve better inference speed and output quality in code generation. Specifically, Saber is motivated by two key insights in the DLM generation process: 1) it can be adaptively accelerated as more of the code context is established; 2) it requires a backtracking mechanism to reverse the generated tokens. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average improvement of 1.9% over mainstream DLM sampling methods, meanwhile achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
☆ Beating the Winner's Curse via Inference-Aware Policy Optimization
There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. A common approach is to train a machine learning model to predict counterfactual outcomes, and then select the policy that optimizes the predicted objective value. In addition, practitioners also want confidence that the learned policy has better performance than the incumbent policy according to downstream policy evaluation. However, due to the winner's curse-an issue where the policy optimization procedure exploits prediction errors rather than finding actual improvements-predicted performance improvements are often not substantiated by downstream policy optimization. To address this challenge, we propose a novel strategy called inference-aware policy optimization, which modifies policy optimization to account for how the policy will be evaluated downstream. Specifically, it optimizes not only for the estimated objective value, but also for the chances that the policy will be statistically significantly better than the observational policy used to collect data. We mathematically characterize the Pareto frontier of policies according to the tradeoff of these two goals. Based on our characterization, we design a policy optimization algorithm that uses machine learning to predict counterfactual outcomes, and then plugs in these predictions to estimate the Pareto frontier; then, the decision-maker can select the policy that optimizes their desired tradeoff, after which policy evaluation can be performed on the test set as usual. Finally, we perform simulations to illustrate the effectiveness of our methodology.
☆ Extracting Rule-based Descriptions of Attention Features in Transformers
Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form "[Canadian city]... speaks --> English", (2) absence rules of the form "[Montreal]... speaks -/-> English," and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.
comment: Our code is available at https://github.com/princeton-nlp/AttentionRules
☆ Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models NeurIPS 2025
Small Language Models (SLMs) offer compelling advantages in deployment cost and latency, but their accuracy often lags behind larger models, particularly for complex domain-specific tasks. While supervised fine-tuning can help bridge this performance gap, it requires substantial manual effort in data preparation and iterative optimization. We present PaDA-Agent (Pattern-guided Data Augmentation Agent), an evaluation-driven approach that streamlines the data augmentation process for SLMs through coordinated operations. Unlike state-of-the-art approaches that focus on model training errors only and generating error-correcting samples, PaDA-Agent discovers failure patterns from the validation data via evaluations and drafts targeted data augmentation strategies aiming to directly reduce the generalization gap. Our experimental results demonstrate significant improvements over state-of-the-art LLM-based data augmentation approaches for Llama 3.2 1B Instruct model fine-tuning.
comment: Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle
☆ Rethinking PCA Through Duality NeurIPS 2025
Motivated by the recently shown connection between self-attention and (kernel) principal component analysis (PCA), we revisit the fundamentals of PCA. Using the difference-of-convex (DC) framework, we present several novel formulations and provide new theoretical insights. In particular, we show the kernelizability and out-of-sample applicability for a PCA-like family of problems. Moreover, we uncover that simultaneous iteration, which is connected to the classical QR algorithm, is an instance of the difference-of-convex algorithm (DCA), offering an optimization perspective on this longstanding method. Further, we describe new algorithms for PCA and empirically compare them with state-of-the-art methods. Lastly, we introduce a kernelizable dual formulation for a robust variant of PCA that minimizes the $l_1$ deviation of the reconstruction errors.
comment: NeurIPS 2025 poster
☆ HyperDiffusionFields (HyDiF): Diffusion-Guided Hypernetworks for Learning Implicit Molecular Neural Fields
We introduce HyperDiffusionFields (HyDiF), a framework that models 3D molecular conformers as continuous fields rather than discrete atomic coordinates or graphs. At the core of our approach is the Molecular Directional Field (MDF), a vector field that maps any point in space to the direction of the nearest atom of a particular type. We represent MDFs using molecule-specific neural implicit fields, which we call Molecular Neural Fields (MNFs). To enable learning across molecules and facilitate generalization, we adopt an approach where a shared hypernetwork, conditioned on a molecule, generates the weights of the given molecule's MNF. To endow the model with generative capabilities, we train the hypernetwork as a denoising diffusion model, enabling sampling in the function space of molecular fields. Our design naturally extends to a masked diffusion mechanism to support structure-conditioned generation tasks, such as molecular inpainting, by selectively noising regions of the field. Beyond generation, the localized and continuous nature of MDFs enables spatially fine-grained feature extraction for molecular property prediction, something not easily achievable with graph or point cloud based methods. Furthermore, we demonstrate that our approach scales to larger biomolecules, illustrating a promising direction for field-based molecular modeling.
Efficient Long-context Language Model Training by Core Attention Disaggregation
We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.
☆ Generalization Below the Edge of Stability: The Role of Data Geometry
Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparameterized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.
comment: Under Review. Comments welcome!
☆ Gradient Variance Reveals Failure Modes in Flow-Based Generative Models NeurIPS 2025
Rectified Flows learn ODE vector fields whose trajectories are straight between source and target distributions, enabling near one-step inference. We show that this straight-path objective conceals fundamental failure modes: under deterministic training, low gradient variance drives memorization of arbitrary training pairings, even when interpolant lines between pairs intersect. To analyze this mechanism, we study Gaussian-to-Gaussian transport and use the loss gradient variance across stochastic and deterministic regimes to characterize which vector fields optimization favors in each setting. We then show that, in a setting where all interpolating lines intersect, applying Rectified Flow yields the same specific pairings at inference as during training. More generally, we prove that a memorizing vector field exists even when training interpolants intersect, and that optimizing the straight-path objective converges to this ill-defined field. At inference, deterministic integration reproduces the exact training pairings. We validate our findings empirically on the CelebA dataset, confirming that deterministic interpolants induce memorization, while the injection of small noise restores generalization.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Latent Discrete Diffusion Models
We study discrete diffusion for language and other categorical data and focus on a common limitation of masked denoisers: reverse transitions typically factorize across positions, which can weaken joint structure and degrade quality in few-step generation. We propose \emph{Latent Discrete Diffusion Models} (LDDMs), which couple a masked discrete diffusion over tokens with a continuous diffusion over latent embeddings. The latent channel provides a softer signal and carries cross-token dependencies that help resolve ambiguities. We present two instantiations: (i) FUJI-LDDMs, which perform fully joint denoising of tokens and latents, and (ii) SEQ-LDDMs, which sequentially resolve the latent and then the discrete chain conditionally on it. For both variants we derive ELBO-style objectives and discuss design choices to learn informative latents yet amenable to diffusoin modeling. In experiments, LDDMs yield improvements on unconditional generation metrics as compared to state-of-the-art masked discrete diffusion baselines, and are effective at lower sampling budgets, where unmasking many tokens per step is desirable.
☆ PrivaDE: Privacy-preserving Data Evaluation for Blockchain-based Data Marketplaces
Evaluating the relevance of data is a critical task for model builders seeking to acquire datasets that enhance model performance. Ideally, such evaluation should allow the model builder to assess the utility of candidate data without exposing proprietary details of the model. At the same time, data providers must be assured that no information about their data - beyond the computed utility score - is disclosed to the model builder. In this paper, we present PrivaDE, a cryptographic protocol for privacy-preserving utility scoring and selection of data for machine learning. While prior works have proposed data evaluation protocols, our approach advances the state of the art through a practical, blockchain-centric design. Leveraging the trustless nature of blockchains, PrivaDE enforces malicious-security guarantees and ensures strong privacy protection for both models and datasets. To achieve efficiency, we integrate several techniques - including model distillation, model splitting, and cut-and-choose zero-knowledge proofs - bringing the runtime to a practical level. Furthermore, we propose a unified utility scoring function that combines empirical loss, predictive entropy, and feature-space diversity, and that can be seamlessly integrated into active-learning workflows. Evaluation shows that PrivaDE performs data evaluation effectively, achieving online runtimes within 15 minutes even for models with millions of parameters. Our work lays the foundation for fair and automated data marketplaces in decentralized machine learning ecosystems.
☆ From AutoRecSys to AutoRecLab: A Call to Build, Evaluate, and Govern Autonomous Recommender-Systems Research Labs
Recommender-systems research has accelerated model and evaluation advances, yet largely neglects automating the research process itself. We argue for a shift from narrow AutoRecSys tools -- focused on algorithm selection and hyper-parameter tuning -- to an Autonomous Recommender-Systems Research Lab (AutoRecLab) that integrates end-to-end automation: problem ideation, literature analysis, experimental design and execution, result interpretation, manuscript drafting, and provenance logging. Drawing on recent progress in automated science (e.g., multi-agent AI Scientist and AI Co-Scientist systems), we outline an agenda for the RecSys community: (1) build open AutoRecLab prototypes that combine LLM-driven ideation and reporting with automated experimentation; (2) establish benchmarks and competitions that evaluate agents on producing reproducible RecSys findings with minimal human input; (3) create review venues for transparently AI-generated submissions; (4) define standards for attribution and reproducibility via detailed research logs and metadata; and (5) foster interdisciplinary dialogue on ethics, governance, privacy, and fairness in autonomous research. Advancing this agenda can increase research throughput, surface non-obvious insights, and position RecSys to contribute to emerging Artificial Research Intelligence. We conclude with a call to organise a community retreat to coordinate next steps and co-author guidance for the responsible integration of automated research systems.
☆ Enhancing mortality prediction in cardiac arrest ICU patients through meta-modeling of structured clinical data from MIMIC-IV
Accurate early prediction of in-hospital mortality in intensive care units (ICUs) is essential for timely clinical intervention and efficient resource allocation. This study develops and evaluates machine learning models that integrate both structured clinical data and unstructured textual information, specifically discharge summaries and radiology reports, from the MIMIC-IV database. We used LASSO and XGBoost for feature selection, followed by a multivariate logistic regression trained on the top features identified by both models. Incorporating textual features using TF-IDF and BERT embeddings significantly improved predictive performance. The final logistic regression model, which combined structured and textual input, achieved an AUC of 0.918, compared to 0.753 when using structured data alone, a relative improvement 22%. The analysis of the decision curve demonstrated a superior standardized net benefit in a wide range of threshold probabilities (0.2-0.8), confirming the clinical utility of the model. These results underscore the added prognostic value of unstructured clinical notes and support their integration into interpretable feature-driven risk prediction models for ICU patients.
comment: 38 pages, 5 figures, 2 tables, 3 appendices
☆ Accelerating Vision Transformers with Adaptive Patch Sizes
Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.
comment: Project page at https://rccchoudhury.github.io/apt/
☆ Provably Optimal Reinforcement Learning under Safety Filtering
Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.
comment: 17 pages, 3 figures
☆ Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).
☆ MEG-GPT: A transformer-based foundation model for magnetoencephalography data
Modelling the complex spatiotemporal patterns of large-scale brain dynamics is crucial for neuroscience, but traditional methods fail to capture the rich structure in modalities such as magnetoencephalography (MEG). Recent advances in deep learning have enabled significant progress in other domains, such as language and vision, by using foundation models at scale. Here, we introduce MEG-GPT, a transformer based foundation model that uses time-attention and next time-point prediction. To facilitate this, we also introduce a novel data-driven tokeniser for continuous MEG data, which preserves the high temporal resolution of continuous MEG signals without lossy transformations. We trained MEG-GPT on tokenised brain region time-courses extracted from a large-scale MEG dataset (N=612, eyes-closed rest, Cam-CAN data), and show that the learnt model can generate data with realistic spatio-spectral properties, including transient events and population variability. Critically, it performs well in downstream decoding tasks, improving downstream supervised prediction task, showing improved zero-shot generalisation across sessions (improving accuracy from 0.54 to 0.59) and subjects (improving accuracy from 0.41 to 0.49) compared to a baseline methods. Furthermore, we show the model can be efficiently fine-tuned on a smaller labelled dataset to boost performance in cross-subject decoding scenarios. This work establishes a powerful foundation model for electrophysiological data, paving the way for applications in computational neuroscience and neural decoding.
☆ Batch Distillation Data for Developing Machine Learning Anomaly Detection Methods
Machine learning (ML) holds great potential to advance anomaly detection (AD) in chemical processes. However, the development of ML-based methods is hindered by the lack of openly available experimental data. To address this gap, we have set up a laboratory-scale batch distillation plant and operated it to generate an extensive experimental database, covering fault-free experiments and experiments in which anomalies were intentionally induced, for training advanced ML-based AD methods. In total, 119 experiments were conducted across a wide range of operating conditions and mixtures. Most experiments containing anomalies were paired with a corresponding fault-free one. The database that we provide here includes time-series data from numerous sensors and actuators, along with estimates of measurement uncertainty. In addition, unconventional data sources -- such as concentration profiles obtained via online benchtop NMR spectroscopy and video and audio recordings -- are provided. Extensive metadata and expert annotations of all experiments are included. The anomaly annotations are based on an ontology developed in this work. The data are organized in a structured database and made freely available via doi.org/10.5281/zenodo.17395544. This new database paves the way for the development of advanced ML-based AD methods. As it includes information on the causes of anomalies, it further enables the development of interpretable and explainable ML approaches, as well as methods for anomaly mitigation.
☆ R2L: Reliable Reinforcement Learning: Guaranteed Return & Reliable Policies in Reinforcement Learning
In this work, we address the problem of determining reliable policies in reinforcement learning (RL), with a focus on optimization under uncertainty and the need for performance guarantees. While classical RL algorithms aim at maximizing the expected return, many real-world applications - such as routing, resource allocation, or sequential decision-making under risk - require strategies that ensure not only high average performance but also a guaranteed probability of success. To this end, we propose a novel formulation in which the objective is to maximize the probability that the cumulative return exceeds a prescribed threshold. We demonstrate that this reliable RL problem can be reformulated, via a state-augmented representation, into a standard RL problem, thereby allowing the use of existing RL and deep RL algorithms without the need for entirely new algorithmic frameworks. Theoretical results establish the equivalence of the two formulations and show that reliable strategies can be derived by appropriately adapting well-known methods such as Q-learning or Dueling Double DQN. To illustrate the practical relevance of the approach, we consider the problem of reliable routing, where the goal is not to minimize the expected travel time but rather to maximize the probability of reaching the destination within a given time budget. Numerical experiments confirm that the proposed formulation leads to policies that effectively balance efficiency and reliability, highlighting the potential of reliable RL for applications in stochastic and safety-critical environments.
comment: 27 pages
☆ Fine-tuning Flow Matching Generative Models with Intermediate Feedback
Flow-based generative models have shown remarkable success in text-to-image generation, yet fine-tuning them with intermediate feedback remains challenging, especially for continuous-time flow matching models. Most existing approaches solely learn from outcome rewards, struggling with the credit assignment problem. Alternative methods that attempt to learn a critic via direct regression on cumulative rewards often face training instabilities and model collapse in online settings. We present AC-Flow, a robust actor-critic framework that addresses these challenges through three key innovations: (1) reward shaping that provides well-normalized learning signals to enable stable intermediate value learning and gradient control, (2) a novel dual-stability mechanism that combines advantage clipping to prevent destructive policy updates with a warm-up phase that allows the critic to mature before influencing the actor, and (3) a scalable generalized critic weighting scheme that extends traditional reward-weighted methods while preserving model diversity through Wasserstein regularization. Through extensive experiments on Stable Diffusion 3, we demonstrate that AC-Flow achieves state-of-the-art performance in text-to-image alignment tasks and generalization to unseen human preference models. Our results demonstrate that even with a computationally efficient critic model, we can robustly finetune flow models without compromising generative quality, diversity, or stability.
☆ Arbitrated Indirect Treatment Comparisons
Matching-adjusted indirect comparison (MAIC) has been increasingly employed in health technology assessments (HTA). By reweighting subjects from a trial with individual participant data (IPD) to match the covariate summary statistics of another trial with only aggregate data (AgD), MAIC facilitates the estimation of a treatment effect defined with respect to the AgD trial population. This manuscript introduces a new class of methods, termed arbitrated indirect treatment comparisons, designed to address the ``MAIC paradox'' -- a phenomenon highlighted by Jiang et al.~(2025). The MAIC paradox arises when different sponsors, analyzing the same data, reach conflicting conclusions regarding which treatment is more effective. The underlying issue is that each sponsor implicitly targets a different population. To resolve this inconsistency, the proposed methods focus on estimating treatment effects in a common target population, specifically chosen to be the overlap population.
☆ SPACeR: Self-Play Anchoring with Centralized Reference Models
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.
comment: Project page: https://spacer-ai.github.io/
Fast Agnostic Learners in the Plane
We investigate the computational efficiency of agnostic learning for several fundamental geometric concept classes in the plane. While the sample complexity of agnostic learning is well understood, its time complexity has received much less attention. We study the class of triangles and, more generally, the class of convex polygons with $k$ vertices for small $k$, as well as the class of convex sets in a square. We present a proper agnostic learner for the class of triangles that has optimal sample complexity and runs in time $\tilde O({\epsilon^{-6}})$, improving on the algorithm of Dobkin and Gunopulos (COLT `95) that runs in time $\tilde O({\epsilon^{-10}})$. For 4-gons and 5-gons, we improve the running time from $O({\epsilon^{-12}})$, achieved by Fischer and Kwek (eCOLT `96), to $\tilde O({\epsilon^{-8}})$ and $\tilde O({\epsilon^{-10}})$, respectively. We also design a proper agnostic learner for convex sets under the uniform distribution over a square with running time $\tilde O({\epsilon^{-5}})$, improving on the previous $\tilde O(\epsilon^{-8})$ bound at the cost of slightly higher sample complexity. Notably, agnostic learning of convex sets in $[0,1]^2$ under general distributions is impossible because this concept class has infinite VC-dimension. Our agnostic learners use data structures and algorithms from computational geometry and their analysis relies on tools from geometry and probabilistic combinatorics. Because our learners are proper, they yield tolerant property testers with matching running times. Our results raise a fundamental question of whether a gap between the sample and time complexity is inherent for agnostic learning of these and other natural concept classes.
☆ Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models
Balancing exploration and exploitation during reinforcement learning fine-tuning of generative models presents a critical challenge, as existing approaches rely on fixed divergence regularization that creates an inherent dilemma: strong regularization preserves model capabilities but limits reward optimization, while weak regularization enables greater alignment but risks instability or reward hacking. We introduce Adaptive Divergence Regularized Policy Optimization (ADRPO), which automatically adjusts regularization strength based on advantage estimates-reducing regularization for high-value samples while applying stronger regularization to poor samples, enabling policies to navigate between exploration and aggressive exploitation according to data quality. Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation, achieving better semantic alignment and diversity than offline methods like DPO and online methods with fixed regularization like ORW-CFM-W2. ADRPO enables a 2B parameter SD3 model to surpass much larger models with 4.8B and 12B parameters in attribute binding, semantic consistency, artistic style transfer, and compositional control while maintaining generation diversity. ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models, enhancing existing online RL methods like GRPO. In LLM fine-tuning, ADRPO demonstrates an emergent ability to escape local optima through active exploration, while in multi-modal audio reasoning, it outperforms GRPO through superior step-by-step reasoning, enabling a 7B model to outperform substantially larger commercial models including Gemini 2.5 Pro and GPT-4o Audio, offering an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities.
comment: 30 pages
☆ Cross-Domain Long-Term Forecasting: Radiation Dose from Sparse Neutron Sensor via Spatio-Temporal Operator Network
Forecasting unobservable physical quantities from sparse, cross-domain sensor data is a central unsolved problem in scientific machine learning. Existing neural operators and large-scale forecasters rely on dense, co-located input-output fields and short temporal contexts, assumptions that fail in real-world systems where sensing and prediction occur on distinct physical manifolds and over long timescales. We introduce the Spatio-Temporal Operator Network (STONe), a non-autoregressive neural operator that learns a stable functional mapping between heterogeneous domains. By directly inferring high-altitude radiation dose fields from sparse ground-based neutron measurements, STONe demonstrates that operator learning can generalize beyond shared-domain settings. It defines a nonlinear operator between sensor and target manifolds that remains stable over long forecasting horizons without iterative recurrence. This challenges the conventional view that operator learning requires domain alignment or autoregressive propagation. Trained on 23 years of global neutron data, STONe achieves accurate 180-day forecasts with millisecond inference latency. The framework establishes a general principle for cross-domain operator inference, enabling real-time prediction of complex spatiotemporal fields in physics, climate, and energy systems.
☆ TriggerNet: A Novel Explainable AI Framework for Red Palm Mite Detection and Multi-Model Comparison and Heuristic-Guided Annotation
The red palm mite infestation has become a serious concern, particularly in regions with extensive palm cultivation, leading to reduced productivity and economic losses. Accurate and early identification of mite-infested plants is critical for effective management. The current study focuses on evaluating and comparing the ML model for classifying the affected plants and detecting the infestation. TriggerNet is a novel interpretable AI framework that integrates Grad-CAM, RISE, FullGrad, and TCAV to generate novel visual explanations for deep learning models in plant classification and disease detection. This study applies TriggerNet to address red palm mite (Raoiella indica) infestation, a major threat to palm cultivation and agricultural productivity. A diverse set of RGB images across 11 plant species, Arecanut, Date Palm, Bird of Paradise, Coconut Palm, Ginger, Citrus Tree, Palm Oil, Orchid, Banana Palm, Avocado Tree, and Cast Iron Plant was utilized for training and evaluation. Advanced deep learning models like CNN, EfficientNet, MobileNet, ViT, ResNet50, and InceptionV3, alongside machine learning classifiers such as Random Forest, SVM, and KNN, were employed for plant classification. For disease classification, all plants were categorized into four classes: Healthy, Yellow Spots, Reddish Bronzing, and Silk Webbing. Snorkel was used to efficiently label these disease classes by leveraging heuristic rules and patterns, reducing manual annotation time and improving dataset reliability.
comment: 17 pages, 9 figures
☆ Benchmarking Probabilistic Time Series Forecasting Models on Neural Activity NeurIPS 2025
Neural activity forecasting is central to understanding neural systems and enabling closed-loop control. While deep learning has recently advanced the state-of-the-art in the time series forecasting literature, its application to neural activity forecasting remains limited. To bridge this gap, we systematically evaluated eight probabilistic deep learning models, including two foundation models, that have demonstrated strong performance on general forecasting benchmarks. We compared them against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging. Across prediction horizons, several deep learning models consistently outperformed classical approaches, with the best model producing informative forecasts up to 1.5 seconds into the future. Our findings point toward future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Data on the Brain & Mind
☆ Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware
Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.
☆ From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models
Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.
comment: 16 pages, 4 figures
☆ ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues
Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43\% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .
comment: 10 pages, 4 figures, 2 tables
Attention-Guided Deep Adversarial Temporal Subspace Clustering (A-DATSC) Model for multivariate spatiotemporal data ICLR 2025
Deep subspace clustering models are vital for applications such as snowmelt detection, sea ice tracking, crop health monitoring, infectious disease modeling, network load prediction, and land-use planning, where multivariate spatiotemporal data exhibit complex temporal dependencies and reside on multiple nonlinear manifolds beyond the capability of traditional clustering methods. These models project data into a latent space where samples lie in linear subspaces and exploit the self-expressiveness property to uncover intrinsic relationships. Despite their success, existing methods face major limitations: they use shallow autoencoders that ignore clustering errors, emphasize global features while neglecting local structure, fail to model long-range dependencies and positional information, and are rarely applied to 4D spatiotemporal data. To address these issues, we propose A-DATSC (Attention-Guided Deep Adversarial Temporal Subspace Clustering), a model combining a deep subspace clustering generator and a quality-verifying discriminator. The generator, inspired by U-Net, preserves spatial and temporal integrity through stacked TimeDistributed ConvLSTM2D layers, reducing parameters and enhancing generalization. A graph attention transformer based self-expressive network captures local spatial relationships, global dependencies, and both short- and long-range correlations. Experiments on three real-world multivariate spatiotemporal datasets show that A-DATSC achieves substantially superior clustering performance compared to state-of-the-art deep subspace clustering models.
comment: 9 pages, under review submitted to ICLR 2025
☆ SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone
Modern language models are evaluated on large benchmarks, which are difficult to make sense of, especially for model selection. Looking at the raw evaluation numbers themselves using a model-centric lens, we propose SimBA, a three phase framework to Simplify Benchmark Analysis. The three phases of SimBA are: stalk, where we conduct dataset & model comparisons, prowl, where we discover a representative subset, and pounce, where we use the representative subset to predict performance on a held-out set of models. Applying SimBA to three popular LM benchmarks: HELM, MMLU, and BigBenchLite reveals that across all three benchmarks, datasets and models relate strongly to one another (stalk). We develop an representative set discovery algorithm which covers a benchmark using raw evaluation scores alone. Using our algorithm, we find that with 6.25% (1/16), 1.7% (1/58), and 28.4% (21/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95% (prowl). Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error (pounce). Taken together, SimBA can help model developers improve efficiency during model training and dataset creators validate whether their newly created dataset differs from existing datasets in a benchmark. Our code is open source, available at https://github.com/nishantsubramani/simba.
comment: EMNLP 2025 Findings
☆ Demystifying Transition Matching: When and Why It Can Beat Flow Matching
Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.
☆ QINNs: Quantum-Informed Neural Networks
Classical deep neural networks can learn rich multi-particle correlations in collider data, but their inductive biases are rarely anchored in physics structure. We propose quantum-informed neural networks (QINNs), a general framework that brings quantum information concepts and quantum observables into purely classical models. While the framework is broad, in this paper, we study one concrete realisation that encodes each particle as a qubit and uses the Quantum Fisher Information Matrix (QFIM) as a compact, basis-independent summary of particle correlations. Using jet tagging as a case study, QFIMs act as lightweight embeddings in graph neural networks, increasing model expressivity and plasticity. The QFIM reveals distinct patterns for QCD and hadronic top jets that align with physical expectations. Thus, QINNs offer a practical, interpretable, and scalable route to quantum-informed analyses, that is, tomography, of particle collisions, particularly by enhancing well-established deep learning approaches.
comment: 20 pages, 9 figures
☆ Universal Spectral Tokenization via Self-Supervised Panchromatic Representation Learning NeurIPS 2025
Sequential scientific data span many resolutions and domains, and unifying them into a common representation is a key step toward developing foundation models for the sciences. Astronomical spectra exemplify this challenge: massive surveys have collected millions of spectra across a wide range of wavelengths and resolutions, yet analyses remain fragmented across spectral domains (e.g., optical vs. infrared) and object types (e.g., stars vs. galaxies), limiting the ability to pool information across datasets. We present a deep learning model that jointly learns from heterogeneous spectra in a self-supervised manner. Our universal spectral tokenizer processes spectra from a variety of object types and resolutions directly on their native wavelength grids, producing intrinsically aligned, homogeneous, and physically meaningful representations that can be efficiently adapted to achieve competitive performance across a range of downstream tasks. For the first time, we demonstrate that a single model can unify spectral data across resolutions and domains, suggesting that our model can serve as a powerful building block for foundation models in astronomy -- and potentially extend to other scientific domains with heterogeneous sequential data, such as climate and healthcare.
comment: Accepted at NeurIPS 2025 Machine Learning and the Physical Sciences Workshop
☆ OmniCast: A Masked Latent Diffusion Model for Weather Forecasting Across Time Scales NeurIPS 2025
Accurate weather forecasting across time scales is critical for anticipating and mitigating the impacts of climate change. Recent data-driven methods based on deep learning have achieved significant success in the medium range, but struggle at longer subseasonal-to-seasonal (S2S) horizons due to error accumulation in their autoregressive approach. In this work, we propose OmniCast, a scalable and skillful probabilistic model that unifies weather forecasting across timescales. OmniCast consists of two components: a VAE model that encodes raw weather data into a continuous, lower-dimensional latent space, and a diffusion-based transformer model that generates a sequence of future latent tokens given the initial conditioning tokens. During training, we mask random future tokens and train the transformer to estimate their distribution given conditioning and visible tokens using a per-token diffusion head. During inference, the transformer generates the full sequence of future tokens by iteratively unmasking random subsets of tokens. This joint sampling across space and time mitigates compounding errors from autoregressive approaches. The low-dimensional latent space enables modeling long sequences of future latent states, allowing the transformer to learn weather dynamics beyond initial conditions. OmniCast performs competitively with leading probabilistic methods at the medium-range timescale while being 10x to 20x faster, and achieves state-of-the-art performance at the subseasonal-to-seasonal scale across accuracy, physics-based, and probabilistic metrics. Furthermore, we demonstrate that OmniCast can generate stable rollouts up to 100 years ahead. Code and model checkpoints are available at https://github.com/tung-nd/omnicast.
comment: Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Challenges and proposed solutions in modeling multimodal data: A systematic review
Multimodal data modeling has emerged as a powerful approach in clinical research, enabling the integration of diverse data types such as imaging, genomics, wearable sensors, and electronic health records. Despite its potential to improve diagnostic accuracy and support personalized care, modeling such heterogeneous data presents significant technical challenges. This systematic review synthesizes findings from 69 studies to identify common obstacles, including missing modalities, limited sample sizes, dimensionality imbalance, interpretability issues, and finding the optimal fusion techniques. We highlight recent methodological advances, such as transfer learning, generative models, attention mechanisms, and neural architecture search that offer promising solutions. By mapping current trends and innovations, this review provides a comprehensive overview of the field and offers practical insights to guide future research and development in multimodal modeling for medical applications.
♻ ☆ Denoising the Future: Top-p Distributions for Moving Through Time
Inference in dynamic probabilistic models is a complex task involving expensive operations. In particular, for Hidden Markov Models, the whole state space has to be enumerated for advancing in time. Even states with negligible probabilities are considered, resulting in computational inefficiency and increased noise due to the propagation of unlikely probability mass. We propose to denoise the future and speed up inference by using only the top-p states, i.e., the most probable states with accumulated probability p. We show that the error introduced by using only the top-p states is bound by p and the so-called minimal mixing rate of the underlying model. Moreover, in our empirical evaluation, we show that we can expect speedups of at least an order of magnitude, while the error in terms of total variation distance is below 0.09.
comment: Accepted at ECSQARU 2025
♻ ☆ Intrinsic Dimensionality of Fermi-Pasta-Ulam-Tsingou High-Dimensional Trajectories Through Manifold Learning: A Linear Approach
A data-driven approach based on unsupervised machine learning is proposed to infer the intrinsic dimension $m^{\ast}$ of the high-dimensional trajectories of the Fermi-Pasta-Ulam-Tsingou (FPUT) model. Principal component analysis (PCA) is applied to trajectory data consisting of $n_s = 4,000,000$ datapoints, of the FPUT $\beta$ model with $N = 32$ coupled oscillators, revealing a critical relationship between $m^{\ast}$ and the model's nonlinear strength. By estimating the intrinsic dimension $m^{\ast}$ using multiple methods (participation ratio, Kaiser rule, and the Kneedle algorithm), it is found that $m^{\ast}$ increases with the model nonlinearity. Interestingly, in the weakly nonlinear regime, for trajectories initialized by exciting the first mode, the participation ratio estimates $m^{\ast} = 2, 3$, strongly suggesting that quasi-periodic motion on a low-dimensional Riemannian manifold underlies the characteristic energy recurrences observed in the FPUT model.
comment: 14 pages, 15 figures. This version matches the article published in Chaos 35, 103118 (2025)
♻ ☆ Score-based deterministic density sampling
We propose a deterministic sampling framework using Score-Based Transport Modeling for sampling an unnormalized target density $\pi$ given only its score $\nabla \log \pi$. Our method approximates the Wasserstein gradient flow on $\mathrm{KL}(f_t\|\pi)$ by learning the time-varying score $\nabla \log f_t$ on the fly using score matching. While having the same marginal distribution as Langevin dynamics, our method produces smooth deterministic trajectories, resulting in monotone noise-free convergence. We prove that our method dissipates relative entropy at the same rate as the exact gradient flow, provided sufficient training. Numerical experiments validate our theoretical findings: our method converges at the optimal rate, has smooth trajectories, and is often more sample efficient than its stochastic counterpart. Experiments on high-dimensional image data show that our method produces high-quality generations in as few as 15 steps and exhibits natural exploratory behavior. The memory and runtime scale linearly in the sample size.
comment: 13 pages, 2 tables, 11 figures. Key words: Deterministic sampling; score-based transport modeling; Wasserstein gradient flow; relative entropy; Fisher information; annealing; neural network; neural tangent kernel
♻ ☆ Weak-to-Strong Generalization Even in Random Feature Networks, Provably
Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
♻ ☆ REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards NeurIPS 2025
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.
comment: NeurIPS 2025 Spotlight. For code, see https://github.com/open-thought/reasoning-gym
♻ ☆ Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data
Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.
comment: Preprint, under review
♻ ☆ Auto-Prompt Generation is Not Robust: Prompt Optimization Driven by Pseudo Gradient
While automatic prompt generation methods have recently received significant attention, their robustness remains poorly understood. In this paper, we introduce PertBench, a comprehensive benchmark dataset that includes a wide range of input perturbations, designed to systematically evaluate the robustness of current auto-prompting techniques. Our analysis reveals substantial vulnerabilities in existing prompt generation strategies, where even minor modifications to the prompt can lead to significant differences in model output. To address this issue, we propose PGO, a gradient-free prompt generation framework that leverages perturbation types as pseudo-gradient signals to guide LLMs in producing more robust prompts. In contrast to existing methods that assess prompt quality only on clean, well-structured inputs, our approach explicitly emphasizes robustness under noisy and perturbed conditions. Extensive experiments across diverse tasks and multiple LLMs show PGO consistently outperforms previous methods in maintaining performance under input perturbations.
♻ ☆ Navigating the Latent Space Dynamics of Neural Models
Neural networks transform high-dimensional data into compact, structured representations, often modeled as elements of a lower dimensional latent space. In this paper, we present an alternative interpretation of neural models as dynamical systems acting on the latent manifold. Specifically, we show that autoencoder models implicitly define a latent vector field on the manifold, derived by iteratively applying the encoding-decoding map, without any additional training. We observe that standard training procedures introduce inductive biases that lead to the emergence of attractor points within this vector field. Drawing on this insight, we propose to leverage the vector field as a representation for the network, providing a novel tool to analyze the properties of the model and the data. This representation enables to: (i) analyze the generalization and memorization regimes of neural models, even throughout training; (ii) extract prior knowledge encoded in the network's parameters from the attractors, without requiring any input data; (iii) identify out-of-distribution samples from their trajectories in the vector field. We further validate our approach on vision foundation models, showcasing the applicability and effectiveness of our method in real-world scenarios.
♻ ☆ LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching NeurIPS 2025
The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model's prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.
comment: Accepted as a poster presentation at NeurIPS 2025. Camera-ready version. 10 pages, 7 figures
♻ ☆ Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization
Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs. This capability can also serve as a fundamental mechanism for Self-Improvement, for example, by reconstructing datasets with refined results to enhance intrinsic model performance. However, our comprehensive experiments reveal that large language models (LLMs) show no clear evidence of inherent Self-Refinement and may even experience response quality degradation after Self-Refinement. To address this issue, we propose EVOLVE, a simple and effective framework for eliciting and tracking the evolution of Self-Refinement through iterative training. We first explore optimization methods during training to activate the model's Self-Refinement capability. Then, at inference, we investigate various generation strategies to further enhance and utilize Self-Refinement while supplying the necessary data for training. Through synergistic optimization of training and inference stages, we continually evolve the model's Self-Refinement ability, enabling it to better refine its own responses. Moreover, we demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities. Experiments show that the evolved Self-Refinement ability enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3% length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks, improving performance on mathematical reasoning benchmarks such as GSM8K and MATH.
♻ ☆ CLIMB: Class-imbalanced Learning Benchmark on Tabular Data NeurIPS 2025
Class-imbalanced learning (CIL) on tabular data is important in many real-world applications where the minority class holds the critical but rare outcomes. In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data. CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms. Built on a high-quality open-source Python package with unified API designs, detailed documentation, and rigorous code quality controls, CLIMB supports easy implementation and comparison between different CIL algorithms. Through extensive experiments, we provide practical insights on method accuracy and efficiency, highlighting the limitations of naive rebalancing, the effectiveness of ensembles, and the importance of data quality. Our code, documentation, and examples are available at https://github.com/ZhiningLiu1998/imbalanced-ensemble.
comment: NeurIPS 2025, Dataset and Benchmark Track. 18 pages, 7 figures, 8 tables
♻ ☆ Absolute abstraction: a renormalisation group approach
Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. It is well known that abstraction emerges with depth in neural networks, where deep layers capture abstract characteristics of data by combining lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for a representation which is absolutely abstract. This theoretical picture is tested in numerical experiments based on Deep Belief Networks and auto-encoders trained on data of different breadth. These show that representations in neural networks approach the Hierarchical Feature Model as the data get broader and as depth increases, in agreement with theoretical predictions.
comment: 36 pages, 7 figures
♻ ☆ Invertible ResNets for Inverse Imaging Problems: Competitive Performance with Provable Regularization Properties
Learning-based methods have demonstrated remarkable performance in solving inverse problems, particularly in image reconstruction tasks. Despite their success, these approaches often lack theoretical guarantees, which are crucial in sensitive applications such as medical imaging. Recent works by Arndt et al addressed this gap by analyzing a data-driven reconstruction method based on invertible residual networks (iResNets). They revealed that, under reasonable assumptions, this approach constitutes a convergent regularization scheme. However, the performance of the reconstruction method was only validated on academic toy problems and small-scale iResNet architectures. In this work, we address this gap by evaluating the performance of iResNets on two real-world imaging tasks: a linear blurring operator and a nonlinear diffusion operator. To do so, we compare the performance of iResNets against state-of-the-art neural networks, revealing their competitiveness at the expense of longer training times. Moreover, we numerically demonstrate the advantages of the iResNet's inherent stability and invertibility by showcasing increased robustness across various scenarios as well as interpretability of the learned operator, thereby reducing the black-box nature of the reconstruction scheme.
♻ ☆ PICT -- A Differentiable, GPU-Accelerated Multi-Block PISO Solver for Simulation-Coupled Learning Tasks in Fluid Dynamics
Despite decades of advancements, the simulation of fluids remains one of the most challenging areas of in scientific computing. Supported by the necessity of gradient information in deep learning, differentiable simulators have emerged as an effective tool for optimization and learning in physics simulations. In this work, we present our fluid simulator PICT, a differentiable pressure-implicit solver coded in PyTorch with Graphics-processing-unit (GPU) support. We first verify the accuracy of both the forward simulation and our derived gradients in various established benchmarks like lid-driven cavities and turbulent channel flows before we show that the gradients provided by our solver can be used to learn complicated turbulence models in 2D and 3D. We apply both supervised and unsupervised training regimes using physical priors to match flow statistics. In particular, we learn a stable sub-grid scale (SGS) model for a 3D turbulent channel flow purely based on reference statistics. The low-resolution corrector trained with our solver runs substantially faster than the highly resolved references, while keeping or even surpassing their accuracy. Finally, we give additional insights into the physical interpretation of different solver gradients, and motivate a physically informed regularization technique. To ensure that the full potential of PICT can be leveraged, it is published as open source: https://github.com/tum-pbs/PICT.
comment: Source code at https://github.com/tum-pbs/PICT
♻ ☆ Learning Counterfactual Distributions via Kernel Nearest Neighbors
Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user's weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data, where observations are available only for certain unit-outcome combinations and the observation availability can be correlated with the properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. Furthermore, we demonstrate that our nearest neighbors approach is robust to heteroscedastic noise, provided we have access to two or more measurements for the observed unit-outcome entries, a robustness not present in prior works on nearest neighbors with single measurements.
comment: 44 pages, 10 figures
♻ ☆ A Generic Framework for Conformal Fairness ICLR 2025
Conformal Prediction (CP) is a popular method for uncertainty quantification with machine learning models. While conformal prediction provides probabilistic guarantees regarding the coverage of the true label, these guarantees are agnostic to the presence of sensitive attributes within the dataset. In this work, we formalize \textit{Conformal Fairness}, a notion of fairness using conformal predictors, and provide a theoretically well-founded algorithm and associated framework to control for the gaps in coverage between different sensitive groups. Our framework leverages the exchangeability assumption (implicit to CP) rather than the typical IID assumption, allowing us to apply the notion of Conformal Fairness to data types and tasks that are not IID, such as graph data. Experiments were conducted on graph and tabular datasets to demonstrate that the algorithm can control fairness-related gaps in addition to coverage aligned with theoretical expectations.
comment: ICLR 2025 Camera Ready Version
♻ ☆ SAFES: Sequential Privacy and Fairness Enhancing Data Synthesis for Responsible AI
As data-driven and AI-based decision making gains widespread adoption across disciplines, it is crucial that both data privacy and decision fairness are appropriately addressed. Although differential privacy (DP) provides a robust framework for guaranteeing privacy and methods are available to improve fairness, most prior work treats the two concerns separately. Even though there are existing approaches that consider privacy and fairness simultaneously, they typically focus on a single specific learning task, limiting their generalizability. In response, we introduce SAFES, a Sequential PrivAcy and Fairness Enhancing data Synthesis procedure that sequentially combines DP data synthesis with a fairness-aware data preprocessing step. SAFES allows users flexibility in navigating the privacy-fairness-utility trade-offs. We illustrate SAFES with different DP synthesizers and fairness-aware data preprocessing methods and run extensive experiments on multiple real datasets to examine the privacy-fairness-utility trade-offs of synthetic data generated by SAFES. Empirical evaluations demonstrate that for reasonable privacy loss, SAFES-generated synthetic data can achieve significantly improved fairness metrics with relatively low utility loss.
♻ ☆ Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025
The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year's program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of "Explainability, Interpretability, and Transparency," "Uncertainty, Bias, and Fairness," "Causality," "Domain Adaptation," "Foundation Models," "Learning from Small Medical Data," "Multimodal Methods," and "Scalable, Translational Healthcare Solutions."
♻ ☆ Time-Varying Bayesian Optimization Without a Metronome
Time-Varying Bayesian Optimization (TVBO) is the go-to framework for optimizing a time-varying, expensive, noisy black-box function $f$. However, most of the asymptotic guarantees offered by TVBO algorithms rely on the assumption that observations are acquired at a constant frequency. As the GP inference complexity scales with the cube of its dataset size, this assumption is unrealistic in the long run. In this paper, we relax this assumption and derive the first upper regret bound that explicitly accounts for changes in the observations sampling frequency. Based on this analysis, we formulate practical recommendations about dataset sizes and stale data policies of TVBO algorithms. We illustrate how an algorithm (BOLT) that follows these recommendations performs better than the state-of-the-art of TVBO through experiments on synthetic and real-world problems.
♻ ☆ Solving Oscillator Ordinary Differential Equations in the Time Domain with High Performance via Soft-constrained Physics-informed Neural Network with Small Data
In many scientific and engineering (e.g., physical, biochemical, medical) practices, data generated through expensive experiments or large-scale simulations, are often sparse and noisy. Physics-informed neural network (PINN) incorporates physical information and knowledge into network topology or computational processes as model priors, with the unique advantage of achieving strong generalization with small data. This study aims to investigate the performance characteristics of the soft-constrained PINN method to solving typical linear and nonlinear ordinary differential equations (ODEs) such as primer, Van der Pol and Duffing oscillators, especially the effectiveness, efficiency, and robustness to noise with minimal data. It is verified that the soft-constrained PINN significantly reduces the need for labeled data. With the aid of appropriate collocation points no need to be labeled, it can predict and also extrapolate with minimal data. First-order and second-order ODEs, no matter linear or nonlinear oscillators, require only one and two training data (containing initial values) respectively, just like classical analytic or Runge-Kutta methods, and with equivalent precision and comparable efficiency (fast training in seconds for scalar ODEs). Furthermore, it can conveniently impose a physical law (e.g., conservation of energy) constraint by adding a regularization term to the total loss function, improving the performance to deal with various complexities such as nonlinearity like Duffing. The DeepXDE-based PINN implementation is light code and can be efficiently trained on both GPU and CPU platforms. The mathematical and computational framework of this alternative and feasible PINN method to ODEs, can be easily extended to PDEs, etc., and is becoming a favorable catalyst for the era of Digital Twins.
comment: 17 pages, 7 figures, 2 tables, etc
♻ ☆ Market-Driven Subset Selection for Budgeted Training
Training large language models on massive datasets is computationally expensive, yet empirical evidence suggests that substantial portions of training examples contribute minimally to final performance. Data subset selection addresses this inefficiency by identifying small, high-utility subsets under resource constraints. However, example utility is inherently multi-faceted, encompassing uncertainty, distributional rarity, and diversity signals that are heterogeneous and typically combined through ad hoc weighted sums lacking theoretical grounding. We propose a market-based framework that treats each training example as a tradeable contract and employs the Logarithmic Market Scoring Rule to aggregate multiple utility signals into coherent prices. Heterogeneous signals act as traders, a single liquidity parameter controls concentration versus smoothing, and topic-wise normalization ensures calibrated aggregation. Token budgets are handled explicitly through a price-per-token decision rule with an interpretable length-bias parameter. We establish theoretical connections to maximum-entropy aggregation and provide utility recovery guarantees under noisy but monotone signals. On GSM8K mathematical reasoning under strict 60k-token budgets, our selector achieves parity with strong single-signal baselines while exhibiting lower variance and incurring less than 0.1 GPU-hour overhead. On AGNews classification at 5-25\% retention rates, the market formulation delivers competitive accuracy with improved stability. Our framework unifies multi-signal data curation under fixed computational budgets for prompt-level reasoning and classification tasks.
comment: Retitled major revision of the same work (formerly "Market-Based Data Subset Selection -- Principled Aggregation of Multi-Criteria Example Utility"). Abstract and exposition revised; ablations added; theory clarified. Core results unchanged. Supersedes v1; please process as a replacement
♻ ☆ Asymptotic Performance of Time-Varying Bayesian Optimization
Time-Varying Bayesian Optimization (TVBO) is the go-to framework for optimizing a time-varying black-box objective function that may be noisy and expensive to evaluate, but its excellent empirical performance remains to be understood theoretically. Is it possible for the instantaneous regret of a TVBO algorithm to vanish asymptotically, and if so, when? We answer this question of great importance by providing upper bounds and algorithm-independent lower bounds for the cumulative regret of TVBO algorithms. In doing so, we provide important insights about the TVBO framework and derive sufficient conditions for a TVBO algorithm to have the no-regret property. To the best of our knowledge, our analysis is the first to cover all major classes of stationary kernel functions used in practice.
♻ ☆ Limitations of Normalization in Attention Mechanism
This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.
comment: 10 pages, 4 figures
♻ ☆ Wavy Transformer NeurIPS 2025
Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.
comment: Accepted by NeurIPS 2025
♻ ☆ Identifiable Latent Bandits: Leveraging observational data for personalized decision-making
Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. In personalized medicine, for example, training a bandit from scratch for every patient is typically infeasible, as the number of trials required is much larger than the number of decision points for a single patient. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found. In this work, we propose an identifiable latent bandit framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.
comment: 29 pages, 17 figures
♻ ☆ DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection NeurIPS 2025
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.
comment: 30 pages, 12 figures, 4 tables
♻ ☆ AI-Generated Video Detection via Perceptual Straightening NeurIPS 2025
The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.
comment: Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
♻ ☆ PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation NeurIPS 2025
Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications. Code and data are available at: github.com/ForeverBlue816/PhysioWave
comment: 43 pages, 17 figures, 17 tables. Accepted by NeurIPS 2025. Code and data are available at: github.com/ForeverBlue816/PhysioWave
♻ ☆ TimeEmb: A Lightweight Static-Dynamic Disentanglement Framework for Time Series Forecasting
Temporal non-stationarity, the phenomenon that time series distributions change over time, poses fundamental challenges to reliable time series forecasting. Intuitively, the complex time series can be decomposed into two factors, \ie time-invariant and time-varying components, which indicate static and dynamic patterns, respectively. Nonetheless, existing methods often conflate the time-varying and time-invariant components, and jointly learn the combined long-term patterns and short-term fluctuations, leading to suboptimal performance facing distribution shifts. To address this issue, we initiatively propose a lightweight static-dynamic decomposition framework, TimeEmb, for time series forecasting. TimeEmb innovatively separates time series into two complementary components: (1) time-invariant component, captured by a novel global embedding module that learns persistent representations across time series, and (2) time-varying component, processed by an efficient frequency-domain filtering mechanism inspired by full-spectrum analysis in signal processing. Experiments on real-world datasets demonstrate that TimeEmb outperforms state-of-the-art baselines and requires fewer computational resources. We conduct comprehensive quantitative and qualitative analyses to verify the efficacy of static-dynamic disentanglement. This lightweight framework can also improve existing time-series forecasting methods with simple integration. To ease reproducibility, the code is available at https://github.com/showmeon/TimeEmb.
♻ ☆ Quantum Reinforcement Learning Trading Agent for Sector Rotation in the Taiwan Stock Market
We propose a hybrid quantum-classical reinforcement learning framework for sector rotation in the Taiwan stock market. Our system employs Proximal Policy Optimization (PPO) as the backbone algorithm and integrates both classical architectures (LSTM, Transformer) and quantum-enhanced models (QNN, QRWKV, QASA) as policy and value networks. An automated feature engineering pipeline extracts financial indicators from capital share data to ensure consistent model input across all configurations. Empirical backtesting reveals a key finding: although quantum-enhanced models consistently achieve higher training rewards, they underperform classical models in real-world investment metrics such as cumulative return and Sharpe ratio. This discrepancy highlights a core challenge in applying reinforcement learning to financial domains -- namely, the mismatch between proxy reward signals and true investment objectives. Our analysis suggests that current reward designs may incentivize overfitting to short-term volatility rather than optimizing risk-adjusted returns. This issue is compounded by the inherent expressiveness and optimization instability of quantum circuits under Noisy Intermediate-Scale Quantum (NISQ) constraints. We discuss the implications of this reward-performance gap and propose directions for future improvement, including reward shaping, model regularization, and validation-based early stopping. Our work offers a reproducible benchmark and critical insights into the practical challenges of deploying quantum reinforcement learning in real-world finance.
♻ ☆ Test-Time Training for Speech Enhancement
This paper introduces a novel application of Test-Time Training (TTT) for Speech Enhancement, addressing the challenges posed by unpredictable noise conditions and domain shifts. This method combines a main speech enhancement task with a self-supervised auxiliary task in a Y-shaped architecture. The model dynamically adapts to new domains during inference time by optimizing the proposed self-supervised tasks like noise-augmented signal reconstruction or masked spectrogram prediction, bypassing the need for labeled data. We further introduce various TTT strategies offering a trade-off between adaptation and efficiency. Evaluations across synthetic and real-world datasets show consistent improvements across speech quality metrics, outperforming the baseline model. This work highlights the effectiveness of TTT in speech enhancement, providing insights for future research in adaptive and robust speech processing.
comment: Published in the Proceedings of Interspeech 2025
♻ ☆ Diffusion Transformers as Open-World Spatiotemporal Foundation Models NeurIPS 2025
The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems. In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scales up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format; 2) With task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain. Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/UrbanDiT.
comment: Accepted by NeurIPS 2025
♻ ☆ What should a neuron aim for? Designing local objective functions based on information theory ICLR 2025
In modern deep neural networks, the learning dynamics of the individual neurons is often obscure, as the networks are trained via global optimization. Conversely, biological systems build on self-organized, local learning, achieving robustness and efficiency with limited global information. We here show how self-organization between individual artificial neurons can be achieved by designing abstract bio-inspired local learning goals. These goals are parameterized using a recent extension of information theory, Partial Information Decomposition (PID), which decomposes the information that a set of information sources holds about an outcome into unique, redundant and synergistic contributions. Our framework enables neurons to locally shape the integration of information from various input classes, i.e. feedforward, feedback, and lateral, by selecting which of the three inputs should contribute uniquely, redundantly or synergistically to the output. This selection is expressed as a weighted sum of PID terms, which, for a given problem, can be directly derived from intuitive reasoning or via numerical optimization, offering a window into understanding task-relevant local information processing. Achieving neuron-level interpretability while enabling strong performance using local learning, our work advances a principled information-theoretic foundation for local learning strategies.
comment: Presented as an oral at ICLR 2025. Conference version: https://openreview.net/forum?id=CLE09ESvul, 24 pages, 11 figures
♻ ☆ Predicting Patient Recovery or Mortality Using Deep Neural Decision Tree and Forest
Objective: Identifying patients at high risk of mortality is crucial for emergency physicians to allocate hospital resources effectively, particularly in regions with limited medical services. This need becomes even more pressing during global health crises that lead to significant morbidity and mortality. This study aimed to present the usability deep neural decision forest and deep neural decision tree to predict mortality among Coronavirus disease 2019 (COVID-19) patients. To this end, We used patient data encompassing Coronavirus disease 2019 diagnosis, demographics, health indicators, and occupational risk factors to analyze disease severity and outcomes. The dataset was partitioned using a stratified sampling method, ensuring that 80% was allocated for training and 20% for testing. Nine machine learning and deep learning methods were employed to build predictive models. The models were evaluated across all stages to determine their effectiveness in predicting patient outcomes. Results: Among the models, the deep neural decision forest consistently outperformed others. Results indicated that using only clinical data yielded an accuracy of 80% by deep neural decision forest, demonstrating it as a reliable predictor of patient mortality. Moreover, the results suggest that clinical data alone may be the most accurate diagnostic tool for predicting mortality.
♻ ☆ A Pure Hypothesis Test for Inhomogeneous Random Graph Models Based on a Kernelised Stein Discrepancy
Complex data are often represented as a graph, which in turn can often be viewed as a realisation of a random graph, such as an inhomogeneous random graph model (IRG). For general fast goodness-of-fit tests in high dimensions, kernelised Stein discrepancy (KSD) tests are a powerful tool. Here, we develop a KSD-type test for IRG models that can be carried out with a single observation of the network. The test applies to a network of any size, but is particularly interesting for small networks for which asymptotic tests are not warranted. We also provide theoretical guarantees.
comment: 49 pages, 24 figures
♻ ☆ GIST: Greedy Independent Set Thresholding for Max-Min Diversification with Submodular Utility
This work studies a novel subset selection problem called max-min diversification with monotone submodular utility ($\textsf{MDMS}$), which has a wide range of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of $\textsf{MDMS}$ is to maximize $f(S) = g(S) + \lambda \cdot \texttt{div}(S)$ subject to a cardinality constraint $|S| \le k$, where $g(S)$ is a monotone submodular function and $\texttt{div}(S) = \min_{u,v \in S : u \ne v} \text{dist}(u,v)$ is the max-min diversity objective. We propose the $\texttt{GIST}$ algorithm, which gives a $\frac{1}{2}$-approximation guarantee for $\textsf{MDMS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate within a factor of $0.5584$. Finally, we show in our empirical study that $\texttt{GIST}$ outperforms state-of-the-art benchmarks for a single-shot data sampling task on ImageNet.
comment: 21 pages, 3 figures
♻ ☆ Parameter Efficient Fine-tuning via Explained Variance Adaptation NeurIPS 2025
Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned for a specific downstream task. The most common fine-tuning method is to update pretrained weights via low-rank adaptation (LoRA). Existing initialization strategies for LoRA often rely on singular value decompositions (SVD) of gradients or weight matrices. However, they do not provably maximize the expected gradient signal, which is critical for fast adaptation. To this end, we introduce Explained Variance Adaptation (EVA), an initialization scheme that uses the directions capturing the most activation variance, provably maximizing the expected gradient signal and accelerating fine-tuning. EVA performs incremental SVD on minibatches of activation vectors and selects the right-singular vectors for initialization once they converged. Further, by selecting the directions that capture the most activation-variance for a given rank budget, EVA accommodates adaptive ranks that reduce the number of trainable parameters. We apply EVA to a variety of fine-tuning tasks as language generation and understanding, image classification, and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution. In summary, EVA establishes a new Pareto frontier compared to existing LoRA initialization schemes in both accuracy and efficiency.
comment: Accepted at NeurIPS 2025, Shared first authorship, Code available at https://github.com/ml-jku/EVA
♻ ☆ FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization
The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.
comment: 1p pages, 7 figures, 2 tables
♻ ☆ The Syntax and Semantics of einsum
In 2011, einsum was introduced to NumPy as a practical and convenient notation for tensor expressions in machine learning, quantum circuit simulation, and other fields. It has since been implemented in additional Python frameworks such as PyTorch and TensorFlow, as well as in other programming languages such as Julia. Despite its practical success, the einsum notation still lacks a solid theoretical basis, and is not unified across the different frameworks, limiting opportunities for formal reasoning and systematic optimization. In this work, we discuss the terminology of tensor expressions and provide a formal definition of the einsum language. Based on this definition, we formalize and prove important equivalence rules for tensor expressions and highlight their relevance in practical applications.
comment: 21 pages, 1 figure. Includes formal definitions, proofs of algebraic properties, and nesting/denesting rules for the einsum notation
♻ ☆ Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.
♻ ☆ Adv-SSL: Adversarial Self-Supervised Representation Learning with Theoretical Guarantees NeurIPS 2025
Learning transferable data representations from abundant unlabeled data remains a central challenge in machine learning. Although numerous self-supervised learning methods have been proposed to address this challenge, a significant class of these approaches aligns the covariance or correlation matrix with the identity matrix. Despite impressive performance across various downstream tasks, these methods often suffer from biased sample risk, leading to substantial optimization shifts in mini-batch settings and complicating theoretical analysis. In this paper, we introduce a novel \underline{\bf Adv}ersarial \underline{\bf S}elf-\underline{\bf S}upervised Representation \underline{\bf L}earning (Adv-SSL) for unbiased transfer learning with no additional cost compared to its biased counterparts. Our approach not only outperforms the existing methods across multiple benchmark datasets but is also supported by comprehensive end-to-end theoretical guarantees. Our analysis reveals that the minimax optimization in Adv-SSL encourages representations to form well-separated clusters in the embedding space, provided there is sufficient upstream unlabeled data. As a result, our method achieves strong classification performance even with limited downstream labels, shedding new light on few-shot learning.
comment: Accepted at the Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning NeurIPS 2025
Sparse-reward reinforcement learning (RL) can model a wide range of highly complex tasks. Solving sparse-reward tasks is RL's core premise, requiring efficient exploration coupled with long-horizon credit assignment, and overcoming these challenges is key for building self-improving agents with superhuman ability. Prior work commonly explores with the objective of solving many sparse-reward tasks, making exploration of individual high-dimensional, long-horizon tasks intractable. We argue that solving such challenging tasks requires solving simpler tasks that are relevant to the target task, i.e., whose achieval will teach the agent skills required for solving the target task. We demonstrate that this sense of direction, necessary for effective exploration, can be extracted from existing RL algorithms, without leveraging any prior information. To this end, we propose a method for directed sparse-reward goal-conditioned very long-horizon RL (DISCOVER), which selects exploratory goals in the direction of the target task. We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent's initial distance to the target, but independent of the volume of the space of all tasks. We then perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.
comment: NeurIPS 2025
♻ ☆ General agents contain world models ICML 2025
Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent's policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.
comment: Accepted ICML 2025. Typos corrected
♻ ☆ What is Memory? A Homological Perspective
We introduce the delta-homology model of memory, a unified framework in which recall, learning, and prediction emerge from cycle closure, the completion of topologically constrained trajectories within the brain's latent manifold. A Dirac-like memory trace corresponds to a nontrivial homology generator, representing a sparse, irreducible attractor that reactivates only when inference trajectories close upon themselves. In this view, memory is not a static attractor landscape but a topological process of recurrence, where structure arises through the stabilization of closed loops. Building on this principle, we represent spike-timing dynamics as spatiotemporal complexes, in which temporally consistent transitions among neurons form chain complexes supporting persistent activation cycles. These cycles are organized into cell posets, compact causal representations that encode overlapping and compositional memory traces. Within this construction, learning and recall correspond to cycle closure under contextual modulation: inference trajectories stabilize into nontrivial homology classes when both local synchrony (context) and global recurrence (content) are satisfied. We formalize this mechanism through the Context-Content Uncertainty Principle (CCUP), which states that cognition minimizes joint uncertainty between a high-entropy context variable and a low-entropy content variable. Synchronization acts as a context filter selecting coherent subnetworks, while recurrence acts as a content filter validating nontrivial cycles.
♻ ☆ OCR-APT: Reconstructing APT Stories from Audit Logs using Subgraph Anomaly Detection and LLMs
Advanced Persistent Threats (APTs) are stealthy cyberattacks that often evade detection in system-level audit logs. Provenance graphs model these logs as connected entities and events, revealing relationships that are missed by linear log representations. Existing systems apply anomaly detection to these graphs but often suffer from high false positive rates and coarse-grained alerts. Their reliance on node attributes like file paths or IPs leads to spurious correlations, reducing detection robustness and reliability. To fully understand an attack's progression and impact, security analysts need systems that can generate accurate, human-like narratives of the entire attack. To address these challenges, we introduce OCR-APT, a system for APT detection and reconstruction of human-like attack stories. OCR-APT uses Graph Neural Networks (GNNs) for subgraph anomaly detection, learning behavior patterns around nodes rather than fragile attributes such as file paths or IPs. This approach leads to a more robust anomaly detection. It then iterates over detected subgraphs using Large Language Models (LLMs) to reconstruct multi-stage attack stories. Each stage is validated before proceeding, reducing hallucinations and ensuring an interpretable final report. Our evaluations on the DARPA TC3, OpTC, and NODLINK datasets show that OCR-APT outperforms state-of-the-art systems in both detection accuracy and alert interpretability. Moreover, OCR-APT reconstructs human-like reports that comprehensively capture the attack story.
comment: This is the authors' extended version of the paper accepted for publication at the ACM SIGSAC Conference on Computer and Communications Security (CCS 2025). The final published version is available at https://doi.org/10.1145/3719027.3765219
♻ ☆ CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation
Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities. However, they generally evaluate the generated code based on static prompts, and tend to fail for complex code scenarios which typically involve multiple requirements and require more contextual information. In addition, these approaches lack fine-grained evaluation for complex code, resulting in limited explainability. To mitigate the limitations, we propose CodeVisionary, the first agent-based evaluation framework for complex code generation. CodeVisionary consists of two stages: (1) Requirement-guided multi-dimensional context distillation stage and (2) Fine-grained scoring and summarization stage. A comprehensive evaluation report is also generated for enhanced explainability. For validation, we construct a new benchmark consisting of 363 samples spanning 37 coding scenarios and 23 programming languages. Extensive experiments demonstrate that CodeVisionary achieves the best performance among three baselines for evaluating complex code generation, outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. The resources of CodeVisionary are available at https://github.com/Eshe0922/CodeVisionary.
♻ ☆ LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models
Text-Attributed Graphs (TAGs), where each node is associated with text descriptions, are ubiquitous in real-world scenarios. They typically exhibit distinctive structure and domain-specific knowledge, motivating the development of a Graph Foundation Model (GFM) that generalizes across diverse graphs and tasks. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures with two-stage alignment, limiting their synergistic potential. Even worse, existing methods assign out-of-vocabulary (OOV) tokens to graph nodes, leading to graph-specific semantics, token explosion, and incompatibility with task-oriented prompt templates, which hinders cross-graph and cross-task transferability. To address these challenges, we propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning. PromptGFM comprises two key components: (1) Graph Understanding Module, which explicitly prompts LLMs to replicate the finest GNN workflow within the text space, facilitating seamless GNN-LLM integration and elegant graph-text alignment; (2) Graph Inference Module, which establishes a language-based graph vocabulary ensuring expressiveness, transferability, and scalability, enabling readable instructions for LLM fine-tuning. Extensive experiments demonstrate our superiority and transferability across diverse graphs and tasks. The code is available at this: https://github.com/agiresearch/PromptGFM.
♻ ☆ Communications to Circulations: Real-Time 3D Wind Field Prediction Using 5G GNSS Signals and Deep Learning
Accurate atmospheric wind field information is crucial for various applications, including weather forecasting, aviation safety, and disaster risk reduction. However, obtaining high spatiotemporal resolution wind data remains challenging due to limitations in traditional in-situ observations and remote sensing techniques, as well as the computational expense and biases of numerical weather prediction (NWP) models. This paper introduces G-WindCast, a novel deep learning framework that leverages signal strength variations from 5G Global Navigation Satellite System (GNSS) signals to forecast three-dimensional (3D) atmospheric wind fields. The framework utilizes Forward Neural Networks (FNN) and Transformer networks to capture complex, nonlinear, and spatiotemporal relationships between GNSS-derived features and wind dynamics. Our preliminary results demonstrate promising accuracy in real-time wind forecasts (up to 30 minutes lead time). The model exhibits robustness across forecast horizons and different pressure levels, and its predictions for wind fields show superior agreement with ground-based radar wind profiler compared to concurrent European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5). Furthermore, we show that the system can maintain excellent performance for localized forecasting even with a significantly reduced number of GNSS stations (e.g., around 100), highlighting its cost-effectiveness and scalability. This interdisciplinary approach underscores the transformative potential of exploiting non-traditional data sources and deep learning for advanced environmental monitoring and real-time atmospheric applications.
comment: 31 pages, 10 figures; Minor text revisions; Updated the questions, some images in the article, the abstract, and the main text content
♻ ☆ CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.
comment: Work in progress
♻ ☆ Application-oriented automatic hyperparameter optimization for spiking neural network prototyping
Hyperparameter optimization (HPO) is of paramount importance in the development of high-performance, specialized artificial intelligence (AI) models, ranging from well-established machine learning (ML) solutions to the deep learning (DL) domain and the field of spiking neural networks (SNNs). The latter introduce further complexity due to the neuronal computational units and their additional hyperparameters, whose inadequate setting can dramatically impact the final model performance. At the cost of possible reduced generalization capabilities, the most suitable strategy to fully disclose the power of SNNs is to adopt an application-oriented approach and perform extensive HPO experiments. To facilitate these operations, automatic pipelines are fundamental, and their configuration is crucial. In this document, the Neural Network Intelligence (NNI) toolkit is used as reference framework to present one such solution, with a use case example providing evidence of the corresponding results. In addition, a summary of published works employing the presented pipeline is reported as a potential source of insights into application-oriented HPO experiments for SNN prototyping.
♻ ☆ Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
comment: 16 pages, 12 figures
♻ ☆ Going with the Flow: Approximating Banzhaf Values via Graph Neural Networks
Computing the Banzhaf value in network flow games is fundamental for quantifying agent influence in multi-agent systems, with applications ranging from cybersecurity to infrastructure planning. However, exact computation is intractable for systems with more than $\sim20$ agents due to exponential complexity $\mathcal{O}(2^m)$. While Monte Carlo sampling methods provide statistical estimates, they suffer from high sample complexity and cannot transfer knowledge across different network configurations, making them impractical for large-scale or dynamic systems. We present a novel learning-based approach using Graph Neural Networks (GNNs) to approximate Banzhaf values in cardinal network flow games. By framing the problem as a graph-level prediction task, our method learns generalisable patterns of agent influence directly from network topology and control structure. We conduct a comprehensive empirical study comparing three state-of-the-art GNN architectures-Graph Attention Networks (GAT), Graph Isomorphism Networks with Edge features (GINE), and EdgeConv-on a large-scale synthetic dataset of 200,000 graphs per configuration, varying in size (20-100 nodes), agent count (5-20), and edge probability (0.5-1.0). Our results demonstrate that trained GNN models achieve high-fidelity Banzhaf value approximation with order-of-magnitude speedups compared to exact and sampling-based methods. Most significantly, we show strong zero-shot generalisation: models trained on graphs of a specific size and topology accurately predict Banzhaf values for entirely new networks with different structural properties, without requiring retraining. This work establishes GNNs as a practical tool for scalable cooperative game-theoretic analysis of complex networked systems.
comment: 21 pages, 8 figures, 11-page appendix
♻ ☆ A deep solver for backward stochastic Volterra integral equations
We present the first deep-learning solver for backward stochastic Volterra integral equations (BSVIEs) and their fully-coupled forward-backward variants. The method trains a neural network to approximate the two solution fields in a single stage, avoiding the use of nested time-stepping cycles that limit classical algorithms. For the decoupled case we prove a non-asymptotic error bound composed of an a posteriori residual plus the familiar square root dependence on the time step. Numerical experiments are consistent with this rate and reveal two key properties: \emph{scalability}, in the sense that accuracy remains stable from low dimension up to 500 spatial variables while GPU batching keeps wall-clock time nearly constant; and \emph{generality}, since the same method handles coupled systems whose forward dynamics depend on the backward solution. These results open practical access to a family of high-dimensional, time-inconsistent problems in stochastic control and quantitative finance.
comment: 25 pages, 10 figures
♻ ☆ Neural Green's Operators for Parametric Partial Differential Equations
This work introduces a paradigm for constructing parametric neural operators that are derived from finite-dimensional representations of Green's operators, with learnable Green's functions, for linear partial differential equations (PDEs). We refer to such neural operators as Neural Green's Operators (NGOs). Our construction of NGOs preserves the linear action of Green's operators on the inhomogeneity fields, while approximating the nonlinear dependence of the Green's function on the coefficients of the PDE using neural networks that take weighted averages of such coefficients as input. This construction reduces the complexity of the problem from learning the entire solution operator and its dependence on all parameters to only learning the Green's function and its dependence on the PDE coefficients. Moreover, taking weighted averages, rather than point samples, of input functions decouples the network size from the number of sampling points, enabling efficient resolution of multiple scales in the input fields. Furthermore, we show that our explicit representation of Green's functions enables the embedding of desirable mathematical attributes in our NGO architectures, such as symmetry, spectral, and conservation properties. Through numerical benchmarks on canonical PDEs, we demonstrate that NGOs achieve comparable or superior accuracy to deep operator networks, variationally mimetic operator networks, and Fourier neural operators with similar parameter counts, while generalizing significantly better when tested on out-of-distribution data. For time-dependent PDEs, we show that NGOs can produce pointwise-accurate dynamics in an auto-regressive manner when trained on a single time step. Finally, we show that we can leverage the explicit representation of Green's functions returned by NGOs to construct effective matrix preconditioners that accelerate iterative solvers for PDEs.
♻ ☆ Conformal online model aggregation
Conformal prediction equips machine learning models with a reasonable notion of uncertainty quantification without making strong distributional assumptions. It wraps around any prediction model and converts point predictions into set predictions with a predefined marginal coverage guarantee. However, conformal prediction only works if we fix the underlying machine learning model in advance. A relatively unaddressed issue in conformal prediction is that of model selection and/or aggregation: given a set of prediction models, which one should we conformalize? This paper suggests that instead of performing model selection, it can be prudent and practical to perform conformal set aggregation in an online, adaptive fashion. We propose a wrapper that takes in several conformal prediction sets (themselves wrapped around black-box prediction models), and outputs a single adaptively-combined prediction set. Our method, called conformal online model aggregation (COMA), is based on combining the prediction sets from several algorithms by weighted voting, and can be thought of as a sort of online stacking of the underlying conformal sets. As long as the input sets have (distribution-free) coverage guarantees, COMA retains coverage guarantees, under a negative correlation assumption between errors and weights. We verify that the assumption holds empirically in all settings considered. COMA is well-suited for decentralized or distributed settings, where different users may have different models, and are only willing to share their prediction sets for a new test point in a black-box fashion. As we demonstrate, it is also well-suited to settings with distribution drift and shift, where model selection can be imprudent.
♻ ☆ Progressive Tempering Sampler with Diffusion ICML 2025
Recent research has focused on designing neural samplers that amortize the process of sampling from unnormalized densities. However, despite significant advancements, they still fall short of the state-of-the-art MCMC approach, Parallel Tempering (PT), when it comes to the efficiency of target evaluations. On the other hand, unlike a well-trained neural sampler, PT yields only dependent samples and needs to be rerun -- at considerable computational cost -- whenever new samples are required. To address these weaknesses, we propose the Progressive Tempering Sampler with Diffusion (PTSD), which trains diffusion models sequentially across temperatures, leveraging the advantages of PT to improve the training of neural samplers. We also introduce a novel method to combine high-temperature diffusion models to generate approximate lower-temperature samples, which are minimally refined using MCMC and used to train the next diffusion model. PTSD enables efficient reuse of sample information across temperature levels while generating well-mixed, uncorrelated samples. Our method significantly improves target evaluation efficiency, outperforming diffusion-based neural samplers.
comment: Accepted for publication at ICML 2025
♻ ☆ CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention
Electroencephalograph (EEG) is a crucial tool for studying brain activity. Recently, self-supervised learning methods leveraging large unlabeled datasets have emerged as a potential solution to the scarcity of widely available annotated EEG data. However, current methods suffer from at least one of the following limitations: i) sub-optimal EEG signal modeling, ii) model sizes in the hundreds of millions of trainable parameters, and iii) reliance on private datasets and/or inconsistent public benchmarks, hindering reproducibility. To address these challenges, we introduce a Compact Encoder for Representations of Brain Oscillations using alternating attention (CEReBrO), a new small EEG foundation model. Our tokenization scheme represents EEG signals at a per-channel patch granularity. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention. We present several model sizes ranging from 3.6 million to 85 million parameters. Pre-trained on over 20,000 hours of publicly available scalp EEG recordings with diverse channel configurations, our models set new benchmarks in emotion detection and seizure detection tasks, with competitive performance in anomaly classification and gait prediction. This validates our models' effectiveness and efficiency.
♻ ☆ Robust LLM Training Infrastructure at ByteDance
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
♻ ☆ The quest for the GRAph Level autoEncoder (GRALE)
Although graph-based learning has attracted a lot of attention, graph representation learning is still a challenging task whose resolution may impact key application fields such as chemistry or biology. To this end, we introduce GRALE, a novel graph autoencoder that encodes and decodes graphs of varying sizes into a shared embedding space. GRALE is trained using an Optimal Transport-inspired loss that compares the original and reconstructed graphs and leverages a differentiable node matching module, which is trained jointly with the encoder and decoder. The proposed attention-based architecture relies on Evoformer, the core component of AlphaFold, which we extend to support both graph encoding and decoding. We show, in numerical experiments on simulated and molecular data, that GRALE enables a highly general form of pre-training, applicable to a wide range of downstream tasks, from classification and regression to more complex tasks such as graph interpolation, editing, matching, and prediction.
♻ ☆ Client Clustering Meets Knowledge Sharing: Enhancing Privacy and Robustness in Personalized Peer-to-Peer Learning
The growing adoption of Artificial Intelligence (AI) in Internet of Things (IoT) ecosystems has intensified the need for personalized learning methods that can operate efficiently and privately across heterogeneous, resource-constrained devices. However, enabling effective personalized learning in decentralized settings introduces several challenges, including efficient knowledge transfer between clients, protection of data privacy, and resilience against poisoning attacks. In this paper, we address these challenges by developing P4 (Personalized, Private, Peer-to-Peer) -- a method designed to deliver personalized models for resource-constrained IoT devices while ensuring differential privacy and robustness against poisoning attacks. Our solution employs a lightweight, fully decentralized algorithm to privately detect client similarity and form collaborative groups. Within each group, clients leverage differentially private knowledge distillation to co-train their models, maintaining high accuracy while ensuring robustness to the presence of malicious clients. We evaluate P4 on popular benchmark datasets using both linear and CNN-based architectures across various heterogeneity settings and attack scenarios. Experimental results show that P4 achieves 5% to 30% higher accuracy than leading differentially private peer-to-peer approaches and maintains robustness with up to 30% malicious clients. Additionally, we demonstrate its practicality by deploying it on resource-constrained devices, where collaborative training between two clients adds only ~7 seconds of overhead.
comment: This paper has been accepted for publication at the IEEE Annual Congress on Artificial Intelligence of Things (IEEE AIoT) 2025
♻ ☆ Federated Conditional Conformal Prediction via Generative Models
Conformal Prediction (CP) provides distribution-free uncertainty quantification by constructing prediction sets that guarantee coverage of the true labels. This reliability makes CP valuable for high-stakes federated learning scenarios such as multi-center healthcare. However, standard CP assumes i.i.d. data, which is violated in federated settings where client distributions differ substantially. Existing federated CP methods address this by maintaining marginal coverage on each client, but such guarantees often fail to reflect input-conditional uncertainty. In this work, we propose Federated Conditional Conformal Prediction (Fed-CCP) via generative models, which aims for conditional coverage that adapts to local data heterogeneity. Fed-CCP leverages generative models, such as normalizing flows or diffusion models, to approximate conditional data distributions without requiring the sharing of raw data. This enables each client to locally calibrate conformal scores that reflect its unique uncertainty, while preserving global consistency through federated aggregation. Experiments on real datasets demonstrate that Fed-CCP achieves more adaptive prediction sets.
♻ ☆ Trainable Dynamic Mask Sparse Attention
The increasing demand for long-context modeling in large language models (LLMs) is bottlenecked by the quadratic complexity of the standard self-attention mechanism. The community has proposed sparse attention to mitigate this issue. However, position-aware sparse attention methods rely on static sparse structures that lack adaptability to diverse query contexts, while content-aware sparse attention methods depend on heuristic key-value selection, hindering full differentiability. We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches. Dynamic Mask Attention (DMA) achieves this through three key innovations: First, it leverages value vector representations to generate content-aware dynamic masks, enabling the model to adaptively identify and attend to critical information. Second, it computes position-aware sparse weights in a hardware-friendly manner, efficiently skipping unnecessary computational regions. Finally, we demonstrate that the introduced dynamic mask and sparse weights do not obstruct gradients, supporting end-to-end training. We have validated the performance of DMA through comprehensive experiments. A large body of experimental evidence shows that DMA consistently holds a Pareto advantage over state-of-the-art sparse attention baselines in tasks including scaling laws, multi-query associative recall, standard benchmarks, and needle in a haystack tests, while also delivering up to a 10x overall speedup. These results highlight its ability to effectively balance model efficiency with long-context modeling capabilities. Our computational kernel code is now open-source at https://github.com/SmallDoges/flash-dmattn to encourage further research and application by the community.
comment: 26 pages
♻ ☆ Path Gradients after Flow Matching
Boltzmann Generators have emerged as a promising machine learning tool for generating samples from equilibrium distributions of molecular systems using Normalizing Flows and importance weighting. Recently, Flow Matching has helped speed up Continuous Normalizing Flows (CNFs), scale them to more complex molecular systems, and minimize the length of the flow integration trajectories. We investigate the benefits of using path gradients to fine-tune CNFs initially trained by Flow Matching, in the setting where a target energy is known. Our experiments show that this hybrid approach yields up to a threefold increase in sampling efficiency for molecular systems, all while using the same model, a similar computational budget and without the need for additional sampling. Furthermore, by measuring the length of the flow trajectories during fine-tuning, we show that path gradients largely preserve the learned structure of the flow.
♻ ☆ Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling NeurIPS2025
Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs) by using additional inference-time computation to explore multiple reasoning paths through search. Yet how to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time. To bridge this gap, we formulate test-time search as a resource allocation problem and derive the optimal allocation strategy that maximizes the probability of obtaining a correct solution under a fixed rollout budget. Within this formulation, we reveal a core limitation of existing search methods: solution-level allocation tends to favor reasoning directions with more candidates, leading to theoretically suboptimal and inefficient use of compute. To address this, we propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias by decoupling direction quality from candidate count and allocating resources at the direction level. To demonstrate DORA's effectiveness, we conduct extensive experiments on challenging mathematical reasoning benchmarks including MATH500, AIME2024, and AIME2025. The empirical results show that DORA consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art accuracy. We hope our findings contribute to a broader understanding of optimal TTS for LLMs.
comment: Accepted at NeurIPS2025
♻ ☆ Multiscale Neural PDE Surrogates for Prediction and Downscaling: Application to Ocean Currents NeurIPS 2025
Accurate modeling of physical systems governed by partial differential equations is a central challenge in scientific computing. In oceanography, high-resolution current data are critical for coastal management, environmental monitoring, and maritime safety. However, available satellite products, such as Copernicus data for sea water velocity at ~0.08 degrees spatial resolution and global ocean models, often lack the spatial granularity required for detailed local analyses. In this work, we (a) introduce a supervised deep learning framework based on neural operators for solving PDEs and providing arbitrary resolution solutions, and (b) propose downscaling models with an application to Copernicus ocean current data. Additionally, our method can model surrogate PDEs and predict solutions at arbitrary resolution, regardless of the input resolution. We evaluated our model on real-world Copernicus ocean current data and synthetic Navier-Stokes simulation datasets.
comment: Tackling Climate Change with Machine Learning: workshop at NeurIPS 2025
♻ ☆ Towards Explainable Deep Clustering for Time Series Data ECML
Deep clustering uncovers hidden patterns and groups in complex time series data, yet its opaque decision-making limits use in safety-critical settings. This survey offers a structured overview of explainable deep clustering for time series, collecting current methods and their real-world applications. We thoroughly discuss and compare peer-reviewed and preprint papers through application domains across healthcare, finance, IoT, and climate science. Our analysis reveals that most work relies on autoencoder and attention architectures, with limited support for streaming, irregularly sampled, or privacy-preserved series, and interpretability is still primarily treated as an add-on. To push the field forward, we outline six research opportunities: (1) combining complex networks with built-in interpretability; (2) setting up clear, faithfulness-focused evaluation metrics for unsupervised explanations; (3) building explainers that adapt to live data streams; (4) crafting explanations tailored to specific domains; (5) adding human-in-the-loop methods that refine clusters and explanations together; and (6) improving our understanding of how time series clustering models work internally. By making interpretability a primary design goal rather than an afterthought, we propose the groundwork for the next generation of trustworthy deep clustering time series analytics.
comment: 14 pages, accepted at TempXAI Workshop at ECML-PKDD 2025
♻ ☆ Synthetic Series-Symbol Data Generation for Time Series Foundation Models NeurIPS 2025
Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.
comment: 64 pages, 25 figures, 35 tables, NeurIPS 2025 accepted
♻ ☆ RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation
Interactive Theorem Proving was repeatedly shown to be fruitful combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We highlight the importance of thorough premise selection for generating Rocq proofs and propose a novel approach, leveraging retrieval via a self-attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator's performance. We tackle the problem of writing Rocq proofs using a multi-stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and demonstrate shows that incorporating multi-agent debate during the planning stage increases the proof success rate by 20% overall and nearly doubles it for complex theorems, while the reflection mechanism further enhances stability and consistency.
♻ ☆ Reliable Wireless Indoor Localization via Cross-Validated Prediction-Powered Calibration
Wireless indoor localization using predictive models with received signal strength information (RSSI) requires proper calibration for reliable position estimates. One remedy is to employ synthetic labels produced by a (generally different) predictive model. But fine-tuning an additional predictor, as well as estimating residual bias of the synthetic labels, demands additional data, aggravating calibration data scarcity in wireless environments. This letter proposes an approach that efficiently uses limited calibration data to simultaneously fine-tune a predictor and estimate the bias of synthetic labels, yielding prediction sets with rigorous coverage guarantees. Experiments on a fingerprinting dataset validate the effectiveness of the proposed method.
♻ ☆ Neural Dynamic Data Valuation: A Stochastic Optimal Control Approach
Data valuation has become a cornerstone of the modern data economy, where datasets function as tradable intellectual assets that drive decision-making, model training, and market transactions. Despite substantial progress, existing valuation methods remain limited by high computational cost, weak fairness guarantees, and poor interpretability, which hinder their deployment in large-scale, high-stakes applications. This paper introduces Neural Dynamic Data Valuation (NDDV), a new framework that formulates data valuation as a stochastic optimal control problem to capture the dynamic evolution of data utility over time. Unlike static combinatorial approaches, NDDV models data interactions through continuous trajectories that reflect both individual and collective learning dynamics.
comment: 14 pages, 10 figures
♻ ☆ ReDi: Rectified Discrete Flow
Discrete Flow-based Models (DFMs) are powerful generative models for high-quality discrete data but typically suffer from slow sampling speeds due to their reliance on iterative decoding processes. This reliance on a multi-step process originates from the factorization approximation of DFMs, which is necessary for handling high-dimensional data. In this paper, we analyze the factorization approximation error using Conditional Total Correlation (TC), and reveal its dependence on the coupling. To address the challenge of efficient few-step generation, we propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces the underlying factorization error (measured as Conditional TC) by rectifying the coupling between source and target distributions. We theoretically prove that each ReDi step guarantees a monotonic decreasing Conditional TC, ensuring its convergence. Empirically, ReDi significantly reduces Conditional TC and enables few-step generation. Moreover, we demonstrate that the rectified couplings are well-suited for training efficient one-step models on image generation. ReDi offers a simple and theoretically grounded approach for tackling the few-step challenge, providing a new perspective on efficient discrete data synthesis. Code is available at https://github.com/Ugness/ReDi_discrete.
♻ ☆ LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation ICCV 2025
Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Open Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing. Project Website: https://langtraj.github.io/
comment: ICCV 2025
♻ ☆ Hyperspectral Anomaly Detection Fused Unified Nonconvex Tensor Ring Factors Regularization
In recent years, tensor decomposition-based approaches for hyperspectral anomaly detection (HAD) have gained significant attention in the field of remote sensing. However, existing methods often fail to fully leverage both the global correlations and local smoothness of the background components in hyperspectral images (HSIs), which exist in both the spectral and spatial domains. This limitation results in suboptimal detection performance. To mitigate this critical issue, we put forward a novel HAD method named HAD-EUNTRFR, which incorporates an enhanced unified nonconvex tensor ring (TR) factors regularization. In the HAD-EUNTRFR framework, the raw HSIs are first decomposed into background and anomaly components. The TR decomposition is then employed to capture the spatial-spectral correlations within the background component. Additionally, we introduce a unified and efficient nonconvex regularizer, induced by tensor singular value decomposition (TSVD), to simultaneously encode the low-rankness and sparsity of the 3-D gradient TR factors into a unique concise form. The above characterization scheme enables the interpretable gradient TR factors to inherit the low-rankness and smoothness of the original background. To further enhance anomaly detection, we design a generalized nonconvex regularization term to exploit the group sparsity of the anomaly component. To solve the resulting doubly nonconvex model, we develop a highly efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) framework. Experimental results on several benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art (SOTA) approaches in terms of detection accuracy.
♻ ☆ When majority rules, minority loses: bias amplification of gradient descent
Despite growing empirical evidence of bias amplification in machine learning, its theoretical foundations remain poorly understood. We develop a formal framework for majority-minority learning tasks, showing how standard training can favor majority groups and produce stereotypical predictors that neglect minority-specific features. Assuming population and variance imbalance, our analysis reveals three key findings: (i) the close proximity between ``full-data'' and stereotypical predictors, (ii) the dominance of a region where training the entire model tends to merely learn the majority traits, and (iii) a lower bound on the additional training required. Our results are illustrated through experiments in deep learning for tabular and image classification tasks.
♻ ☆ Language Models are Injective and Hence Invertible
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
♻ ☆ MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature
Synthesis procedures play a critical role in materials research, as they directly affect material properties. With data-driven approaches increasingly accelerating materials discovery, there is growing interest in extracting synthesis procedures from scientific literature as structured data. However, existing studies often rely on rigid, domain-specific schemas with predefined fields for structuring synthesis procedures or assume that synthesis procedures are linear sequences of operations, which limits their ability to capture the structural complexity of real-world procedures. To address these limitations, we adopt PROV-DM, an international standard for provenance information, which supports flexible, graph-based modeling of procedures. We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature using large language models. MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs. This representation enables machine-interpretable synthesis knowledge, opening opportunities for future research such as automated synthesis planning and optimization.
♻ ☆ DeepSeek-Inspired Exploration of RL-based LLMs and Synergy with Wireless Networks: A Survey
Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have attracted widespread attention for their remarkable capabilities in multimodal data understanding. Meanwhile, the rapid expansion of information services has led to a growing demand for AI-enabled wireless networks. The open-source DeepSeek models are famous for their innovative designs, such as large-scale pure RL and cost-efficient training, which make them well-suited for practical deployment in wireless networks. By integrating DeepSeek-style LLMs with wireless infrastructures, a synergistic opportunity arises: the DeepSeek-style LLMs enhance network optimization with strong reasoning and decision-making abilities, while wireless infrastructure enables the broad deployment of these models. Motivated by this convergence, this survey presents a comprehensive DeepSeek-inspired exploration of RL-based LLMs in the context of wireless networks. We begin by reviewing key techniques behind network optimization to establish a foundation for understanding DeepSeek-style LLM integration. Next, we examine recent advancements in RL-based LLMs, using DeepSeek models as a representative example. Building on this, we explore the synergy between the two domains, highlighting motivations, challenges, and potential solutions. Finally, we highlight emerging directions for integrating LLMs with wireless networks, such as quantum, on-device, and neural-symbolic LLM models, as well as embodied AI agents. Overall, this survey offers a comprehensive examination of the interplay between DeepSeek-style LLMs and wireless networks, demonstrating how these domains can mutually enhance each other to drive innovation.
comment: 45 pages, 12 figures
♻ ☆ Error Broadcast and Decorrelation as a Potential Artificial and Natural Learning Mechanism
We introduce Error Broadcast and Decorrelation (EBD), a novel learning framework for neural networks that addresses credit assignment by directly broadcasting output errors to individual layers, circumventing weight transport of backpropagation. EBD is rigorously grounded in the stochastic orthogonality property of Minimum Mean Square Error estimators. This fundamental principle states that the error of an optimal estimator is orthogonal to functions of the input. Guided by this insight, EBD defines layerwise loss functions that directly penalize correlations between layer activations and output errors, thereby establishing a principled foundation for error broadcasting. This theoretically sound mechanism naturally leads to the experimentally observed three-factor learning rule and integrates with biologically plausible frameworks to enhance performance and plausibility. Numerical experiments demonstrate EBD's competitive or better performance against other error-broadcast methods on benchmark datasets. Our findings establish EBD as an efficient, biologically plausible, and principled alternative for neural network training. The implementation is available at: https://github.com/meterdogan07/error-broadcast-decorrelation.
♻ ☆ UniCrossFi: A Unified Framework For Cross-Domain Wi-Fi-based Gesture Recognition
Wi-Fi sensing systems are severely hindered by cross domain problem when deployed in unseen real-world environments. Existing methods typically design separate frameworks for either domain adaptation or domain generalization, often relying on extensive labeled data. Existing methods that designed for domain generalization is often relying on extensive labeled data. However, real-world scenarios are far more complex, where the deployed model must be capable of handling generalization under limited labeled source data. To this end, we propose UniCrossFi, a unified framework designed to mitigate performance drop in CSI-based sensing across diverse deployment settings. Our framework not only extends conventional Domain Generalization (DG) to a more practical Semi-Supervised Domain Generalization (SSDG) setting, where only partially labeled source data are available, but also introduces a physics-informed data augmentation strategy, Antenna Response Consistency (ARC). ARC mitigates the risk of learning superficial shortcuts by exploiting the intrinsic spatial diversity of multi-antenna systems, treating signals from different antennas as naturally augmented views of the same event. In addition, we design a Unified Contrastive Objective to prevent conventional contrastive learning from pushing apart samples from different domains that share the same class. We conduct extensive experiments on the public Widar and CSIDA datasets. The results demonstrate that UniCrossFi consistently establishes a new state-of-the-art, significantly outperforming existing methods across all unsupervised domain adaptation, DG, and SSDG benchmarks. UniCrossFi provides a principled and practical solution to the domain shift challenge, advancing the feasibility of robust, real-world Wi-Fi sensing systems that can operate effectively with limited labeled data.
♻ ☆ Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics
Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.
♻ ☆ From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery
Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement -- behaviors once regarded as uniquely human. This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics. We unify three previously fragmented perspectives -- process-oriented, autonomy-oriented, and mechanism-oriented -- through a comprehensive framework that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across the above domains, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research.
♻ ☆ Aurora: Towards Universal Generative Multimodal Time Series Forecasting
Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
♻ ☆ Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective
Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plugin-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at https://github.com/decisionintelligence/SRSNet.
♻ ☆ Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning
Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM's embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM --> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.
comment: Under Submission
♻ ☆ Re-uploading quantum data: A universal function approximator for quantum inputs
Quantum data re-uploading has proved powerful for classical inputs, where repeatedly encoding features into a small circuit yields universal function approximation. Extending this idea to quantum inputs remains underexplored, as the information contained in a quantum state is not directly accessible in classical form. We propose and analyze a quantum data re-uploading architecture in which a qubit interacts sequentially with fresh copies of an arbitrary input state. The circuit can approximate any bounded continuous function using only one ancilla qubit and single-qubit measurements. By alternating entangling unitaries with mid-circuit resets of the input register, the architecture realizes a discrete cascade of completely positive and trace-preserving maps, analogous to collision models in open quantum system dynamics. Our framework provides a qubit-efficient and expressive approach to designing quantum machine learning models that operate directly on quantum data.
comment: 24 pages, 11 figures
♻ ☆ Membership Inference Attack Should Move On to Distributional Statistics for Distilled Generative Models
To detect unauthorized data usage in training large-scale generative models (e.g., ChatGPT or Midjourney), membership inference attacks (MIA) have proven effective in distinguishing a single training instance (a member) from a single non-training instance (a non-member). This success is mainly credited to a memorization effect: models tend to perform better on a member than a non-member. However, we find that standard MIAs fail against distilled generative models (i.e., student models) that are increasingly deployed in practice for efficiency (e.g., ChatGPT 4o-mini). Trained exclusively on data generated from a large-scale model (a teacher model), the student model lacks direct exposure to any members (teacher's training data), nullifying the memorization effect that standard MIAs rely on. This finding reveals a serious privacy loophole, where generation-service providers could deploy a student model whose teacher was potentially trained on unauthorized data, yet claim the deployed model is clean because it was not directly trained on such data. Hence, are distilled models inherently unauditable for upstream privacy violations, and should we discard them when we care about privacy? We contend no, as we uncover a memory chain connecting the student and teacher's member data: the distribution of student-generated data aligns more closely with the distribution of the teacher's members than with non-members, thus we can detect unauthorized data usage even when direct instance-level memorization is absent. This leads us to posit that MIAs on distilled generative models should shift from instance-level scores to distribution-level statistics. We further propose three principles of distribution-based MIAs for detecting unauthorized training data through distilled generative models, and validate our position through an exemplar framework. We lastly discuss the implications of our position.
♻ ☆ Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities NeurIPS 2025
LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question in context of mathematical inequalities -- specifically the prover's ability to recognize that the given problem simplifies by applying a known inequality such as AM/GM. Specifically, we are interested in their ability to do this in a compositional setting where multiple inequalities must be applied as part of a solution. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers -- including Goedel, STP, and Kimina-7B -- struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness, but still suffers a 20% performance drop (pass@32). Even for DeepSeek-Prover-V2-671B model, the gap between compositional variants and seed problems exists, implying that simply scaling up the model size alone does not fully solve the compositional weakness. Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition. All data and evaluation code can be found at https://github.com/haoyuzhao123/LeanIneqComp.
comment: To appear in NeurIPS 2025 Track on Datasets and Benchmarks. 28 pages
♻ ☆ When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement Approach
Text-attributed graphs (TAGs) have become a key form of graph-structured data in modern data management and analytics, combining structural relationships with rich textual semantics for diverse applications. However, the effectiveness of analytical models, particularly graph neural networks (GNNs), is highly sensitive to data quality. Our empirical analysis shows that both conventional and LLM-enhanced GNNs degrade notably under textual, structural, and label imperfections, underscoring TAG quality as a key bottleneck for reliable analytics. Existing studies have explored data-level optimization for TAGs, but most focus on specific degradation types and target a single aspect like structure or label, lacking a systematic and comprehensive perspective on data quality improvement. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework for comprehensive TAG quality optimization. LAGA formulates graph quality control as a data-centric process, integrating detection, planning, action, and evaluation agents into an automated loop. It holistically enhances textual, structural, and label aspects through coordinated multi-modal optimization. Extensive experiments on 5 datasets and 16 baselines across 9 scenarios demonstrate the effectiveness, robustness and scalability of LAGA, confirming the importance of data-centric quality optimization for reliable TAG analytics.
comment: 12 pages, 7figures
♻ ☆ Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction
As large-scale pre-trained foundation models continue to expand in size and capability, efficiently adapting them to specific downstream tasks has become increasingly critical. Despite substantial progress, existing adaptation approaches have evolved largely in isolation, without a clear understanding of their interrelationships. This survey introduces neural network reprogrammability as a unifying framework that bridges mainstream model adaptation techniques--model reprogramming, prompt tuning, and prompt instruction--previously fragmented research areas yet converges on a shared principle: repurposing a pre-trained model by manipulating information at the interfaces while keeping the model parameters frozen. These methods exploit neural networks' sensitivity to manipulation on different interfaces, be it through perturbing inputs, inserting tokens into intermediate layers, or providing task-specific examples in context, to redirect model behaviors towards desired outcomes. We then present a taxonomy that categorizes such information manipulation-based adaptation approaches across four key dimensions: manipulation format (fixed or learnable), location (interfaces where manipulations occur), operator (how they are applied), and output alignment requirement (post-processing needed to align outputs with downstream tasks). Notably, this framework applies consistently across data modalities, independent of specific model architectures. Moreover, viewing established techniques like in-context learning and chain-of-thought prompting through this lens reveals both their theoretical connections and practical distinctions. We further analyze remaining technical challenges and ethical considerations, positioning neural network reprogrammability as a fundamental paradigm for efficient model adaptation. We lastly identify promising research directions emerging from this integrative viewpoint.
♻ ☆ On Task Vectors and Gradients
Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
comment: 10 pages of main paper, 5 figures
♻ ☆ Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems
Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor intensive. We present an automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations on two real-world student code submission datasets in different programming languages.We find that KCGen-KT outperforms existing KT methods and human-written KCs on future student response prediction. We investigate the learning curves of generated KCs and show that LLM-generated KCs result in a better fit than human written KCs under a cognitive model. We also conduct a human evaluation with course instructors to show that our pipeline generates reasonably accurate problem-KC mappings.
♻ ☆ BLUR: A Bi-Level Optimization Approach for LLM Unlearning
Enabling large language models (LLMs) to unlearn knowledge and capabilities acquired during training has proven vital for ensuring compliance with data regulations and promoting ethical practices in generative AI. Although there are growing interests in developing various unlearning algorithms, it remains unclear how to best formulate the unlearning problem. The most popular formulation uses a weighted sum of forget and retain loss, but it often leads to performance degradation due to the inherent trade-off between forget and retain losses. In this work, we argue that it is important to model the hierarchical structure of the unlearning problem, where the forget problem (which \textit{unlearns} certain knowledge and/or capabilities) takes priority over the retain problem (which preserves model utility). This hierarchical structure naturally leads to a bi-level optimization formulation where the lower-level objective focuses on minimizing the forget loss, while the upper-level objective aims to maintain the model's utility. Based on this new formulation, we propose a novel algorithm, termed Bi-Level UnleaRning (\texttt{BLUR}), which not only possesses strong theoretical guarantees but more importantly, delivers superior performance. In particular, our extensive experiments demonstrate that \texttt{BLUR} consistently outperforms all the state-of-the-art algorithms across various unlearning tasks, models, and metrics. Codes are available at https://github.com/OptimAI-Lab/BLURLLMUnlearning.
♻ ☆ Machine Learning for Early Detection of Meningitis: Stacked Ensemble Learning with EHR Data
We utilized a cohort of 214 meningitis patients and 46,303 non-meningitis patients from the MIMIC-III database. After extensive data preprocessing, which included ICD-based cohort selection, one-hot encoding of coding, and a two-stage feature selection process (for both the training set and the testing sets), clinically relevant features such as gender and high-risk ICD codes (including subarachnoid hemorrhage, secondary malignant neoplasm of the brain, and generalized epilepsy) are selected. Overall, these clinically reasonable and temporally adherent features provided excellent modeling performance. Three models (Random Forest, LightGBM, and Deep Neural Networks (DNN) are trained as base models for Ensemble Learning. Base model outputs are aggregated and stacked into a meta model (Logistic Regression) that uses the base model outputs as input values in training. Ultimately, soldier outputs (AUC of Testing Set 1: 0.9637, AUC of Testing Set 2: 0.9472) are obtained through ensemble learning. We created a challenging condition for diagnosing meningitis, simulating a real-world ER (Emergency Room) scenario to enhance clinical use in real-world applications. While directly deploying a diagnostic tool that clinicians can use is challenging, this paper paves the way for a potential future AI-driven diagnostic approach for meningitis using Ensemble Learning.
♻ ☆ Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees
Distributed optimization is pivotal for large-scale signal processing and machine learning, yet communication overhead remains a major bottleneck. Low-rank gradient compression, in which the transmitted gradients are approximated by low-rank matrices to reduce communication, offers a promising remedy. Existing methods typically adopt either randomized or greedy compression strategies: randomized approaches project gradients onto randomly chosen subspaces, introducing high variance and degrading empirical performance; greedy methods select the most informative subspaces, achieving strong empirical results but lacking convergence guarantees. To address this gap, we propose GreedyLore--the first Greedy Low-Rank gradient compression algorithm for distributed learning with rigorous convergence guarantees. GreedyLore incorporates error feedback to correct the bias introduced by greedy compression and introduces a semi-lazy subspace update that ensures the compression operator remains contractive throughout all iterations. With these techniques, we prove that GreedyLore achieves a convergence rate of $\mathcal{O}(\sigma/\sqrt{NT} + 1/T)$ under standard optimizers such as MSGD and Adam--marking the first linear speedup convergence rate for low-rank gradient compression. Extensive experiments are conducted to validate our theoretical findings.
comment: 17 pages, 5 figures
♻ ☆ Importance-Aware Activation Space Reconstruction
Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.
♻ ☆ RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility NeurIPS
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors. To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reasoners. Methodologically, RHYTHM employs temporal tokenization to partition each trajectory into daily segments and encode them as discrete tokens with hierarchical attention that captures both daily and weekly dependencies, thereby quadratically reducing the sequence length while preserving cyclical information. Additionally, we enrich token representations by adding pre-computed prompt embeddings for trajectory segments and prediction targets via a frozen LLM, and feeding these combined embeddings back into the LLM backbone to capture complex interdependencies. Computationally, RHYTHM keeps the pretrained LLM backbone frozen, yielding faster training and lower memory usage. We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2.4% improvement in overall accuracy, a 5.0% increase on weekends, and a 24.6% reduction in training time. Code is publicly available at https://github.com/he-h/rhythm.
comment: Advances in Neural Information Processing Systems 39 (NeurIPS) 2025
♻ ☆ GeoRecon: Graph-Level Representation Learning for 3D Molecules via Reconstruction-Based Pretraining
The pretraining-finetuning paradigm has powered major advances in domains such as natural language processing and computer vision, with representative examples including masked language modeling and next-token prediction. In molecular representation learning, however, pretraining tasks remain largely restricted to node-level denoising, which effectively captures local atomic environments but is often insufficient for encoding the global molecular structure critical to graph-level property prediction tasks such as energy estimation and molecular regression. To address this gap, we introduce GeoRecon, a graph-level pretraining framework that shifts the focus from individual atoms to the molecule as an integrated whole. GeoRecon formulates a graph-level reconstruction task: during pretraining, the model is trained to produce an informative graph representation that guides geometry reconstruction while inducing smoother and more transferable latent spaces. This encourages the learning of coherent, global structural features beyond isolated atomic details. Without relying on external supervision, GeoRecon generally improves over backbone baselines on multiple molecular benchmarks including QM9, MD17, MD22, and 3BPA, demonstrating the effectiveness of graph-level reconstruction for holistic and geometry-aware molecular embeddings.
♻ ☆ SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB
Enhancing scene understanding in adverse visibility conditions remains a critical challenge for surveillance and autonomous navigation systems. Conventional imaging modalities, such as RGB and thermal infrared (MWIR / LWIR), when fused, often struggle to deliver comprehensive scene information, particularly under conditions of atmospheric interference or inadequate illumination. To address these limitations, Short-Wave Infrared (SWIR) imaging has emerged as a promising modality due to its ability to penetrate atmospheric disturbances and differentiate materials with improved clarity. However, the advancement and widespread implementation of SWIR-based systems face significant hurdles, primarily due to the scarcity of publicly accessible SWIR datasets. In response to this challenge, our research introduces an approach to synthetically generate SWIR-like structural/contrast cues (without claiming spectral reproduction) images from existing LWIR data using advanced contrast enhancement techniques. We then propose a multimodal fusion framework integrating synthetic SWIR, LWIR, and RGB modalities, employing an optimized encoder-decoder neural network architecture with modality-specific encoders and a softmax-gated fusion head. Comprehensive experiments on public RGB-LWIR benchmarks (M3FD, TNO, CAMEL, MSRS, RoadScene) and an additional private real RGB-MWIR-SWIR dataset demonstrate that our synthetic-SWIR-enhanced fusion framework improves fused-image quality (contrast, edge definition, structural fidelity) while maintaining real-time performance. We also add fair trimodal baselines (LP, LatLRR, GFF) and cascaded trimodal variants of U2Fusion/SwinFusion under a unified protocol. The outcomes highlight substantial potential for real-world applications in surveillance and autonomous systems.
♻ ☆ Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data
This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM's statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM's zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems.
♻ ☆ PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries NeurIPS 2025
LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism--decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation. To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation. By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Datasets and Benchmarks Track
♻ ☆ VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture
In modern smart agriculture, object detection plays a crucial role by enabling automation, precision farming, and monitoring of resources. From identifying crop health and pest infestations to optimizing harvesting processes, accurate object detection enhances both productivity and sustainability. However, training object detection models often requires large-scale data collection and raises privacy concerns, particularly when sensitive agricultural data is distributed across farms. To address these challenges, we propose VLLFL, a vision-language model-based lightweight federated learning framework (VLLFL). It harnesses the generalization and context-aware detection capabilities of the vision-language model (VLM) and leverages the privacy-preserving nature of federated learning. By training a compact prompt generator to boost the performance of the VLM deployed across different farms, VLLFL preserves privacy while reducing communication overhead. Experimental results demonstrate that VLLFL achieves 14.53% improvement in the performance of VLM while reducing 99.3% communication overhead. Spanning tasks from identifying a wide variety of fruits to detecting harmful animals in agriculture, the proposed framework offers an efficient, scalable, and privacy-preserving solution specifically tailored to agricultural applications.
♻ ☆ Steering Generative Models with Experimental Data for Protein Fitness Optimization NeurIPS 2025
Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.
comment: NeurIPS 2025
♻ ☆ The Impact of Coreset Selection on Spurious Correlations and Group Robustness
Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.
comment: 10 pages, 9 additional pages for Appendix
♻ ☆ Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving NeurIPS 2025
Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing. We provide theoretical guarantees demonstrating that our algorithm achieves a competitive ratio of $1 - o(1)$ under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55$\times$ in overall performance, 1.85$\times$ in cost efficiency, and nearly 4.25$\times$ in throughput. Our code is available at https://github.com/fzwark/PORT.
comment: NeurIPS 2025
♻ ☆ One protein is all you need
Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a model's capacity to excel on any specific one, whereas experimentalists typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. Through two challenging case studies, we also show that customization via ProteinTTT achieves more accurate antibody-antigen loop modeling and enhances 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.
♻ ☆ FlySearch: Exploring how vision-language models explore NeurIPS 2025
The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.
comment: NeurIPS 2025 Datasets and Benchmarks track
♻ ☆ Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
The comparison between discriminative and generative classifiers has intrigued researchers since Efron's seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures - Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical 'two regimes' phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.
comment: 23 pages - Nominated for Outstanding Paper award at EMNLP 2025
♻ ☆ AICO: Feature Significance Tests for Supervised Learning
The opacity of many supervised learning algorithms remains a key challenge, hindering scientific discovery and limiting broader deployment--particularly in high-stakes domains. This paper develops model- and distribution-agnostic significance tests to assess the influence of input features in any regression or classification algorithm. Our method evaluates a feature's incremental contribution to model performance by masking its values across samples. Under the null hypothesis, the distribution of performance differences across a test set has a non-positive median. We construct a uniformly most powerful, randomized sign test for this median, yielding exact p-values for assessing feature significance and confidence intervals with exact coverage for estimating population-level feature importance. The approach requires minimal assumptions, avoids model retraining or auxiliary models, and remains computationally efficient even for large-scale, high-dimensional settings. Experiments on synthetic tasks validate its statistical and computational advantages, and applications to real-world data demonstrate its ability to inform high-stakes decisions with legal, economic, and regulatory implications.
♻ ☆ GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection NeurIPS 2025
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Sim2Dust: Mastering Dynamic Waypoint Tracking on Granular Media
Reliable autonomous navigation across the unstructured terrains of distant planetary surfaces is a critical enabler for future space exploration. However, the deployment of learning-based controllers is hindered by the inherent sim-to-real gap, particularly for the complex dynamics of wheel interactions with granular media. This work presents a complete sim-to-real framework for developing and validating robust control policies for dynamic waypoint tracking on such challenging surfaces. We leverage massively parallel simulation to train reinforcement learning agents across a vast distribution of procedurally generated environments with randomized physics. These policies are then transferred zero-shot to a physical wheeled rover operating in a lunar-analogue facility. Our experiments systematically compare multiple reinforcement learning algorithms and action smoothing filters to identify the most effective combinations for real-world deployment. Crucially, we provide strong empirical evidence that agents trained with procedural diversity achieve superior zero-shot performance compared to those trained on static scenarios. We also analyze the trade-offs of fine-tuning with high-fidelity particle physics, which offers minor gains in low-speed precision at a significant computational cost. Together, these contributions establish a validated workflow for creating reliable learning-based navigation systems, marking a substantial step towards deploying autonomous robots in the final frontier.
comment: Accepted for publication at the 2025 International Conference on Space Robotics (iSpaRo) | The source code is available at https://github.com/AndrejOrsula/space_robotics_bench
♻ ☆ CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs
This paper is the second in a series of studies on developing efficient artificial intelligence-based approaches to pathfinding on extremely large graphs (e.g. $10^{70}$ nodes) with a focus on Cayley graphs and mathematical applications. The open-source CayleyPy project is a central component of our research. The present paper proposes a novel combination of a reinforcement learning approach with a more direct diffusion distance approach from the first paper. Our analysis includes benchmarking various choices for the key building blocks of the approach: architectures of the neural network, generators for the random walks and beam search pathfinding. We compared these methods against the classical computer algebra system GAP, demonstrating that they "overcome the GAP" for the considered examples. As a particular mathematical application we examine the Cayley graph of the symmetric group with cyclic shift and transposition generators. We provide strong support for the OEIS-A186783 conjecture that the diameter is equal to n(n-1)/2 by machine learning and mathematical methods. We identify the conjectured longest element and generate its decomposition of the desired length. We prove a diameter lower bound of n(n-1)/2-n/2 and an upper bound of n(n-1)/2+ 3n by presenting the algorithm with given complexity. We also present several conjectures motivated by numerical experiments, including observations on the central limit phenomenon (with growth approximated by a Gumbel distribution), the uniform distribution for the spectrum of the graph, and a numerical study of sorting networks. To stimulate crowdsourcing activity, we create challenges on the Kaggle platform and invite contributions to improve and benchmark approaches on Cayley graph pathfinding and other tasks.
comment: 32+16 pages
♻ ☆ Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems
We introduce the Strategic Doubly Robust (SDR) estimator, a novel framework that integrates strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments. SDR addresses endogenous treatment assignment arising from strategic agent behavior, maintaining double robustness while incorporating strategic considerations. Theoretical analysis confirms SDR's consistency and asymptotic normality under strategic unconfoundedness. Empirical evaluations demonstrate SDR's superior performance over baseline methods, achieving 7.6\%-29.3\% bias reduction across varying strategic strengths and maintaining robust scalability with agent populations. The framework provides a principled approach for reliable causal inference when agents respond strategically to interventions.
♻ ☆ A novel Information-Driven Strategy for Optimal Regression Assessment
In Machine Learning (ML), a regression algorithm aims to minimize a loss function based on data. An assessment method in this context seeks to quantify the discrepancy between the optimal response for an input-output system and the estimate produced by a learned predictive model (the student). Evaluating the quality of a learned regressor remains challenging without access to the true data-generating mechanism, as no data-driven assessment method can ensure the achievability of global optimality. This work introduces the Information Teacher, a novel data-driven framework for evaluating regression algorithms with formal performance guarantees to assess global optimality. Our novel approach builds on estimating the Shannon mutual information (MI) between the input variables and the residuals and applies to a broad class of additive noise models. Through numerical experiments, we confirm that the Information Teacher is capable of detecting global optimality, which is aligned with the condition of zero estimation error with respect to the -- inaccessible, in practice -- true model, working as a surrogate measure of the ground truth assessment loss and offering a principled alternative to conventional empirical performance metrics.
♻ ☆ Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks NeurIPS 2025
Nonlinear activation functions are widely recognized for enhancing the expressivity of neural networks, which is the primary reason for their widespread implementation. In this work, we focus on ReLU activation and reveal a novel and intriguing property of nonlinear activations. By comparing enabling and disabling the nonlinear activations in the neural network, we demonstrate their specific effects on wide neural networks: (a) better feature separation, i.e., a larger angle separation for similar data in the feature space of model gradient, and (b) better NTK conditioning, i.e., a smaller condition number of neural tangent kernel (NTK). Furthermore, we show that the network depth (i.e., with more nonlinear activation operations) further amplifies these effects; in addition, in the infinite-width-then-depth limit, all data are equally separated with a fixed angle in the model gradient feature space, regardless of how similar they are originally in the input space. Note that, without the nonlinear activation, i.e., in a linear neural network, the data separation remains the same as for the original inputs and NTK condition number is equivalent to the Gram matrix, regardless of the network depth. Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.
comment: NeurIPS 2025
♻ ☆ Modeling Human Beliefs about AI Behavior for Scalable Oversight
As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators' beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce "belief model covering" as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators' beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.
comment: 56 pages
♻ ☆ Enabling Automatic Differentiation with Mollified Graph Neural Operators
Physics-informed neural operators offer a powerful framework for learning solution operators of partial differential equations (PDEs) by combining data and physics losses. However, these physics losses rely on derivatives. Computing these derivatives remains challenging, with spectral and finite difference methods introducing approximation errors due to finite resolution. Here, we propose the mollified graph neural operator ($m$GNO), the first method to leverage automatic differentiation and compute exact gradients on arbitrary geometries. This enhancement enables efficient training on irregular grids and varying geometries while allowing seamless evaluation of physics losses at randomly sampled points for improved generalization. For a PDE example on regular grids, $m$GNO paired with autograd reduced the L2 relative data error by 20x compared to finite differences, although training was slower. It can also solve PDEs on unstructured point clouds seamlessly, using physics losses only, at resolutions vastly lower than those needed for finite differences to be accurate enough. On these unstructured point clouds, $m$GNO leads to errors that are consistently 2 orders of magnitude lower than machine learning baselines (Meta-PDE, which accelerates PINNs) for comparable runtimes, and also delivers speedups from 1 to 3 orders of magnitude compared to the numerical solver for similar accuracy. $m$GNOs can also be used to solve inverse design and shape optimization problems on complex geometries.
♻ ☆ REOrdering Patches Improves Vision Models NeurIPS 2025
Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
comment: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks NeurIPS
The grand goal of AI research, and particularly Self Supervised Learning (SSL), is to produce systems that can successfully solve any possible task. In contrast, current evaluation methods available to AI researchers typically rely on a fixed collection of hand-picked downstream benchmarks. Hence, a large amount of effort is put into designing and searching for large collection of evaluation tasks that can serve as a proxy of our grand goal. We argue that such a rigid evaluation protocol creates a silent bottleneck in AI research. To remedy that, we define a probabilistic space of downstream tasks obtained by adopting a distribution of tasks and by defining Task Priors. Under this view, one can evaluate a model's performance over the set of all possible downstream tasks. Our framework is the first to provide answers to key questions such as (i) what is the average performance of my model over all possible downstream tasks weighted by the probability to encounter each task? or (ii) what is the variance of my model's performance across all downstream tasks under the defined Task Priors? Beyond establishing a new standard for evaluation, we believe that Task Priors will accelerate the pace of research in SSL - where downstream task evaluation is the sole qualitative signal that researchers have access to.
comment: NeurIPS UniReps Workshop 2025
♻ ☆ LLM Safety Alignment is Divergence Estimation in Disguise NeurIPS 2025
We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.
comment: Accepted to NeurIPS 2025
♻ ☆ Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling NeurIPS 2025
Current state-of-the-art generative models map noise to data distributions by matching flows or scores. A key limitation of these models is their inability to readily integrate available partial observations and additional priors. In contrast, energy-based models (EBMs) address this by incorporating corresponding scalar energy terms. Here, we propose Energy Matching, a framework that endows flow-based approaches with the flexibility of EBMs. Far from the data manifold, samples move from noise to data along irrotational, optimal transport paths. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize these dynamics with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. The present method substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in terms of fidelity, while retaining simulation-free training of transport-based approaches away from the data manifold. Furthermore, we leverage the flexibility of the method to introduce an interaction energy that supports the exploration of diverse modes, which we demonstrate in a controlled protein generation setting. This approach learns a scalar potential energy, without time conditioning, auxiliary generators, or additional networks, marking a significant departure from recent EBM methods. We believe this simplified yet rigorous formulation significantly advances EBMs capabilities and paves the way for their wider adoption in generative modeling in diverse domains.
comment: Accepted to NeurIPS 2025
♻ ☆ Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection
Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren't universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model's OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.
♻ ☆ Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential
Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses ("if-then" rules) built from features extracted from the LLM's latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules' soundness levels, becoming highly distinct for "strict" versus "noisy" rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL's predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model's reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model's internal mechanisms for selecting/designing stronger base models.
comment: Pre-print
♻ ☆ Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation
Robot learning of manipulation skills is hindered by the scarcity of diverse, unbiased datasets. While curated datasets can help, challenges remain in generalizability and real-world transfer. Meanwhile, large-scale "in-the-wild" video datasets have driven progress in computer vision through self-supervised techniques. Translating this to robotics, recent works have explored learning manipulation skills by passively watching abundant videos sourced online. Showing promising results, such video-based learning paradigms provide scalable supervision while reducing dataset bias. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources, as well as emerging techniques for acquiring robot manipulation skills from uncontrolled video demonstrations. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation. The survey summarizes video-based learning approaches, analyses their benefits over standard datasets, survey metrics, and benchmarks, and discusses open challenges and future directions in this nascent domain at the intersection of computer vision, natural language processing, and robot learning.
comment: Published at IEEE Access
♻ ☆ Automotive Crash Dynamics Modeling Accelerated with Machine Learning
Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.
♻ ☆ $\boldsymbolλ$-Orthogonality Regularization for Compatible Representation Learning NeurIPS2025
Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating communication between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while preserving the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely $\lambda$-Orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model's zero-shot performance and ensures compatibility across model updates. Code available at: \href{https://github.com/miccunifi/lambda_orthogonality.git}{https://github.com/miccunifi/lambda\_orthogonality}.
comment: Accepted at NeurIPS2025
♻ ☆ Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification
Group Relative Policy Optimization (GRPO) was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards. We show that the mean + variance calibration of these rewards induces a weighted contrastive loss in which the contrastive samples are synthetic data drawn from the previous policy. While GRPO was originally paired with clipping to keep updates near the old policy, we analyze variants that differ in reward normalization (mean-only vs mean + variance) and in how they regularize updates using KL divergence: either penalizing divergence from the previous model (mirror), penalizing divergence from a fixed reference model $\pi_{\mathrm{ref}}$, or combining both forms of regularization. For each, the optimal policy $\pi_n$ admits an explicit form in terms of the binary reward and the first and second order statistics of the reward under $\pi_{n-1}$, as well as the policies $\pi_{n-1}$ and $\pi_{\mathrm{ref}}$. Iterating results in a sequence $\{\pi_n\}$ whose probability of success (PoS) obeys a simple recurrence that converges to a fixed point determined by the reference PoS and the regularization strength. We further show that this fixed point exceeds the reference, demonstrating that GRPO amplifies the policy's probability of success.
♻ ☆ Identification and Adaptive Control of Markov Jump Systems: Sample Complexity and Regret Bounds
Learning how to effectively control unknown dynamical systems is crucial for intelligent autonomous systems. This task becomes a significant challenge when the underlying dynamics are changing with time. Motivated by this challenge, this paper considers the problem of controlling an unknown Markov jump linear system (MJS) to optimize a quadratic objective. By taking a model-based perspective, we consider identification-based adaptive control of MJSs. We first provide a system identification algorithm for MJS to learn the dynamics in each mode as well as the Markov transition matrix, underlying the evolution of the mode switches, from a single trajectory of the system states, inputs, and modes. Through martingale-based arguments, sample complexity of this algorithm is shown to be $\mathcal{O}(1/\sqrt{T})$. We then propose an adaptive control scheme that performs system identification together with certainty equivalent control to adapt the controllers in an episodic fashion. Combining our sample complexity results with recent perturbation results for certainty equivalent control, we prove that when the episode lengths are appropriately chosen, the proposed adaptive control scheme achieves $\mathcal{O}(\sqrt{T})$ regret, which can be improved to $\mathcal{O}(polylog(T))$ with partial knowledge of the system. Our proof strategy introduces innovations to handle Markovian jumps and a weaker notion of stability common in MJSs. Our analysis provides insights into system theoretic quantities that affect learning accuracy and control performance. Numerical simulations are presented to further reinforce these insights.
comment: Improved results using Martingale-based arguments
♻ ☆ AICO: Feature Significance Tests for Supervised Learning
The opacity of many supervised learning algorithms remains a key challenge, hindering scientific discovery and limiting broader deployment -- particularly in high-stakes domains. This paper develops model- and distribution-agnostic significance tests to assess the influence of input features in any regression or classification algorithm. Our method evaluates a feature's incremental contribution to model performance by masking its values across samples. Under the null hypothesis, the distribution of performance differences across a test set has a non-positive median. We construct a uniformly most powerful, randomized sign test for this median, yielding exact p-values for assessing feature significance and confidence intervals with exact coverage for estimating population-level feature importance. The approach requires minimal assumptions, avoids model retraining or auxiliary models, and remains computationally efficient even for large-scale, high-dimensional settings. Experiments on synthetic tasks validate its statistical and computational advantages, and applications to real-world data demonstrate its ability to inform high-stakes decisions with legal, economic, and regulatory implications.
Computer Vision and Pattern Recognition 85
☆ How Universal Are SAM2 Features?
The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.
comment: This work has been accepted for publication in IEEE Picture Coding Symposium (PCS) 2025
☆ Video Reasoning without Training
Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.
☆ Person Re-Identification via Generalized Class Prototypes
Advanced feature extraction methods have significantly contributed to enhancing the task of person re-identification. In addition, modifications to objective functions have been developed to further improve performance. Nonetheless, selecting better class representatives is an underexplored area of research that can also lead to advancements in re-identification performance. Although past works have experimented with using the centroid of a gallery image class during training, only a few have investigated alternative representations during the retrieval stage. In this paper, we demonstrate that these prior techniques yield suboptimal results in terms of re-identification metrics. To address the re-identification problem, we propose a generalized selection method that involves choosing representations that are not limited to class centroids. Our approach strikes a balance between accuracy and mean average precision, leading to improvements beyond the state of the art. For example, the actual number of representations per class can be adjusted to meet specific application requirements. We apply our methodology on top of multiple re-identification embeddings, and in all cases it substantially improves upon contemporary results
comment: 18 pages, 11 figures, and 4 tables
☆ Click, Predict, Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework
Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.
comment: 13 pages, 2 figures, and 2 tables
☆ DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation
Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.
☆ Conditional Synthetic Live and Spoof Fingerprint Generation
Large fingerprint datasets, while important for training and evaluation, are time-consuming and expensive to collect and require strict privacy measures. Researchers are exploring the use of synthetic fingerprint data to address these issues. This paper presents a novel approach for generating synthetic fingerprint images (both spoof and live), addressing concerns related to privacy, cost, and accessibility in biometric data collection. Our approach utilizes conditional StyleGAN2-ADA and StyleGAN3 architectures to produce high-resolution synthetic live fingerprints, conditioned on specific finger identities (thumb through little finger). Additionally, we employ CycleGANs to translate these into realistic spoof fingerprints, simulating a variety of presentation attack materials (e.g., EcoFlex, Play-Doh). These synthetic spoof fingerprints are crucial for developing robust spoof detection systems. Through these generative models, we created two synthetic datasets (DB2 and DB3), each containing 1,500 fingerprint images of all ten fingers with multiple impressions per finger, and including corresponding spoofs in eight material types. The results indicate robust performance: our StyleGAN3 model achieves a Fr\'echet Inception Distance (FID) as low as 5, and the generated fingerprints achieve a True Accept Rate of 99.47% at a 0.01% False Accept Rate. The StyleGAN2-ADA model achieved a TAR of 98.67% at the same 0.01% FAR. We assess fingerprint quality using standard metrics (NFIQ2, MINDTCT), and notably, matching experiments confirm strong privacy preservation, with no significant evidence of identity leakage, confirming the strong privacy-preserving properties of our synthetic datasets.
☆ Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding
Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe "2D semantic bias" that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.
☆ Enrich and Detect: Video Temporal Grounding with Multimodal LLMs ICCV 2025
We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.
comment: ICCV 2025 (Highlights)
☆ Do Satellite Tasks Need Special Pretraining?
Foundation models have advanced machine learning across various modalities, including images. Recently multiple teams trained foundation models specialized for remote sensing applications. This line of research is motivated by the distinct characteristics of remote sensing imagery, specific applications and types of robustness useful for satellite image analysis. In this work we systematically challenge the idea that specific foundation models are more useful than general-purpose vision foundation models, at least in the small scale. First, we design a simple benchmark that measures generalization of remote sensing models towards images with lower resolution for two downstream tasks. Second, we train iBOT, a self-supervised vision encoder, on MillionAID, an ImageNet-scale satellite imagery dataset, with several modifications specific to remote sensing. We show that none of those pretrained models bring consistent improvements upon general-purpose baselines at the ViT-B scale.
☆ An empirical study of the effect of video encoders on Temporal Video Grounding
Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.
☆ Training-free Online Video Step Grounding NeurIPS 2025
Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.
comment: NeurIPS 2025. Project website at https://lucazanella.github.io/baglm/
☆ CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.
☆ One-step Diffusion Models with Bregman Density Ratio Matching
Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.
comment: work in progress
☆ Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis
Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.
☆ Unlocking Off-the-Grid Sparse Recovery with Unlimited Sensing: Simultaneous Super-Resolution in Time and Amplitude
The recovery of Dirac impulses, or spikes, from filtered measurements is a classical problem in signal processing. As the spikes lie in the continuous domain while measurements are discrete, this task is known as super-resolution or off-the-grid sparse recovery. Despite significant theoretical and algorithmic advances over the past decade, these developments often overlook critical challenges at the analog-digital interface. In particular, when spikes exhibit strong-weak amplitude disparity, conventional digital acquisition may result in clipping of strong components or loss of weak ones beneath the quantization noise floor. This motivates a broader perspective: super-resolution must simultaneously resolve both amplitude and temporal structure. Under a fixed bit budget, such information loss is unavoidable. In contrast, the emerging theory and practice of the Unlimited Sensing Framework (USF) demonstrate that these fundamental limitations can be overcome. Building on this foundation, we demonstrate that modulo encoding within USF enables digital super-resolution by enhancing measurement precision, thereby unlocking temporal super-resolution beyond conventional limits. We develop new theoretical results that extend to non-bandlimited kernels commonly encountered in practice and introduce a robust algorithm for off-the-grid sparse recovery. To demonstrate practical impact, we instantiate our framework in the context of time-of-flight imaging. Both numerical simulations and hardware experiments validate the effectiveness of our approach under low-bit quantization, enabling super-resolution in amplitude and time.
comment: 28 Pages, 10 figures. To appear in IEEE Journal of Selected Topics in Signal Processing
☆ Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
comment: 23 pages,19 figures
☆ Domain Generalizable Continual Learning
To adapt effectively to dynamic real-world environments, intelligent systems must continually acquire new skills while generalizing them to diverse, unseen scenarios. Here, we introduce a novel and realistic setting named domain generalizable continual learning (DGCL): a model learns sequential tasks with each involving a single domain, aiming to perform well across all encountered tasks and domains. This setting poses unique challenges in acquiring, retaining, and leveraging both semantic- and domain-relevant information for robust generalization. Although state-of-the-art continual learning (CL) methods have employed pre-trained models (PTMs) to enhance task-specific generalization, they typically assume identical training and testing domains for each task and therefore perform poorly in DGCL. To this end, we propose adaptive Domain Transformation (DoT), an innovative PTMs-based approach tailored to DGCL. Inspired by the distributed-plus-hub theory of the human brain, DoT disentangles semantic- and domain-relevant information in representation learning, and adaptively transforms task representations across various domains for output alignment, ensuring balanced and generalized predictions. DoT serves as a plug-in strategy that greatly facilitates state-of-the-art CL baselines under both full parameter tuning and parameter-efficient tuning paradigms in DGCL, validated by extensive experiments. Also, DoT is shown to accumulate domain-generalizable knowledge from DGCL, and ensure resource efficiency with a lightweight implementation.
comment: 25 pages
☆ Beyond RGB: Leveraging Vision Transformers for Thermal Weapon Segmentation
Thermal weapon segmentation is crucial for surveillance and security applications, enabling robust detection under lowlight and visually obscured conditions where RGB-based systems fail. While convolutional neural networks (CNNs) dominate thermal segmentation literature, their ability to capture long-range dependencies and fine structural details is limited. Vision Transformers (ViTs), with their global context modeling capabilities, have achieved state-of-the-art results in RGB segmentation tasks, yet their potential in thermal weapon segmentation remains underexplored. This work adapts and evaluates four transformer-based architectures SegFormer, DeepLabV3\+, SegNeXt, and Swin Transformer for binary weapon segmentation on a custom thermal dataset comprising 9,711 images collected from real world surveillance videos and automatically annotated using SAM2. We employ standard augmentation strategies within the MMSegmentation framework to ensure robust model training and fair architectural comparison. Experimental results demonstrate significant improvements in segmentation performance: SegFormer-b5 achieves the highest mIoU (94.15\%) and Pixel Accuracy (97.04\%), while SegFormer-b0 provides the fastest inference speed (98.32 FPS) with competitive mIoU (90.84\%). SegNeXt-mscans offers balanced performance with 85.12 FPS and 92.24\% mIoU, and DeepLabV3\+ R101-D8 reaches 92.76\% mIoU at 29.86 FPS. The transformer architectures demonstrate robust generalization capabilities for weapon detection in low-light and occluded thermal environments, with flexible accuracy-speed trade-offs suitable for diverse real-time security applications.
comment: 9 Images with 1 figure and 3 Tables. This is a preprint submitted to arXiv
☆ Contrail-to-Flight Attribution Using Ground Visible Cameras and Flight Surveillance Data
Aviation's non-CO2 effects, particularly contrails, are a significant contributor to its climate impact. Persistent contrails can evolve into cirrus-like clouds that trap outgoing infrared radiation, with radiative forcing potentially comparable to or exceeding that of aviation's CO2 emissions. While physical models simulate contrail formation, evolution and dissipation, validating and calibrating these models requires linking observed contrails to the flights that generated them, a process known as contrail-to-flight attribution. Satellite-based attribution is challenging due to limited spatial and temporal resolution, as contrails often drift and deform before detection. In this paper, we evaluate an alternative approach using ground-based cameras, which capture contrails shortly after formation at high spatial and temporal resolution, when they remain thin, linear, and visually distinct. Leveraging the ground visible camera contrail sequences (GVCCS) dataset, we introduce a modular framework for attributing contrails observed using ground-based cameras to theoretical contrails derived from aircraft surveillance and meteorological data. The framework accommodates multiple geometric representations and distance metrics, incorporates temporal smoothing, and enables flexible probability-based assignment strategies. This work establishes a strong baseline and provides a modular framework for future research in linking contrails to their source flight.
☆ Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.
☆ Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis
Generative models, especially Diffusion Models, have demonstrated remarkable capability in generating high-quality synthetic data, including medical images. However, traditional class-conditioned generative models often struggle to generate images that accurately represent specific medical categories, limiting their usefulness for applications such as skin cancer diagnosis. To address this problem, we propose a classification-induced diffusion model, namely, Class-N-Diff, to simultaneously generate and classify dermoscopic images. Our Class-N-Diff model integrates a classifier within a diffusion model to guide image generation based on its class conditions. Thus, the model has better control over class-conditioned image synthesis, resulting in more realistic and diverse images. Additionally, the classifier demonstrates improved performance, highlighting its effectiveness for downstream diagnostic tasks. This unique integration in our Class-N-Diff makes it a robust tool for enhancing the quality and utility of diffusion model-based synthetic dermoscopic image generation. Our code is available at https://github.com/Munia03/Class-N-Diff.
comment: EMBC 2025
☆ Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning
Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CL's effectiveness in addressing this challenge through a biologically inspired design. Code is available at https://github.com/gfyddha/Fly-CL.
☆ Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding
While brain-inspired artificial intelligence(AI) has demonstrated promising results, current understanding of the parallels between artificial neural networks (ANNs) and human brain processing remains limited: (1) unimodal ANN studies fail to capture the brain's inherent multimodal processing capabilities, and (2) multimodal ANN research primarily focuses on high-level model outputs, neglecting the crucial role of individual neurons. To address these limitations, we propose a novel neuron-level analysis framework that investigates the multimodal information processing mechanisms in vision-language models (VLMs) through the lens of human brain activity. Our approach uniquely combines fine-grained artificial neuron (AN) analysis with fMRI-based voxel encoding to examine two architecturally distinct VLMs: CLIP and METER. Our analysis reveals four key findings: (1) ANs successfully predict biological neurons (BNs) activities across multiple functional networks (including language, vision, attention, and default mode), demonstrating shared representational mechanisms; (2) Both ANs and BNs demonstrate functional redundancy through overlapping neural representations, mirroring the brain's fault-tolerant and collaborative information processing mechanisms; (3) ANs exhibit polarity patterns that parallel the BNs, with oppositely activated BNs showing mirrored activation trends across VLM layers, reflecting the complexity and bidirectional nature of neural information processing; (4) The architectures of CLIP and METER drive distinct BNs: CLIP's independent branches show modality-specific specialization, whereas METER's cross-modal design yields unified cross-modal activation, highlighting the architecture's influence on ANN brain-like properties. These results provide compelling evidence for brain-like hierarchical processing in VLMs at the neuronal level.
comment: 14 pages, 7 figures
Registration is a Powerful Rotation-Invariance Learner for 3D Anomaly Detection
3D anomaly detection in point-cloud data is critical for industrial quality control, aiming to identify structural defects with high reliability. However, current memory bank-based methods often suffer from inconsistent feature transformations and limited discriminative capacity, particularly in capturing local geometric details and achieving rotation invariance. These limitations become more pronounced when registration fails, leading to unreliable detection results. We argue that point-cloud registration plays an essential role not only in aligning geometric structures but also in guiding feature extraction toward rotation-invariant and locally discriminative representations. To this end, we propose a registration-induced, rotation-invariant feature extraction framework that integrates the objectives of point-cloud registration and memory-based anomaly detection. Our key insight is that both tasks rely on modeling local geometric structures and leveraging feature similarity across samples. By embedding feature extraction into the registration learning process, our framework jointly optimizes alignment and representation learning. This integration enables the network to acquire features that are both robust to rotations and highly effective for anomaly detection. Extensive experiments on the Anomaly-ShapeNet and Real3D-AD datasets demonstrate that our method consistently outperforms existing approaches in effectiveness and generalizability.
☆ BARL: Bilateral Alignment in Representation and Label Spaces for Semi-Supervised Volumetric Medical Image Segmentation
Semi-supervised medical image segmentation (SSMIS) seeks to match fully supervised performance while sharply reducing annotation cost. Mainstream SSMIS methods rely on \emph{label-space consistency}, yet they overlook the equally critical \emph{representation-space alignment}. Without harmonizing latent features, models struggle to learn representations that are both discriminative and spatially coherent. To this end, we introduce \textbf{Bilateral Alignment in Representation and Label spaces (BARL)}, a unified framework that couples two collaborative branches and enforces alignment in both spaces. For label-space alignment, inspired by co-training and multi-scale decoding, we devise \textbf{Dual-Path Regularization (DPR)} and \textbf{Progressively Cognitive Bias Correction (PCBC)} to impose fine-grained cross-branch consistency while mitigating error accumulation from coarse to fine scales. For representation-space alignment, we conduct region-level and lesion-instance matching between branches, explicitly capturing the fragmented, complex pathological patterns common in medical imagery. Extensive experiments on four public benchmarks and a proprietary CBCT dataset demonstrate that BARL consistently surpasses state-of-the-art SSMIS methods. Ablative studies further validate the contribution of each component. Code will be released soon.
comment: 14 pages, 5 figures
☆ ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification
The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.
comment: 9 pages with 4 figures and 5 tables. This is a preprint submitted to arXiv
☆ 2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting
Recent advancements in 3D Gaussian Splatting (3DGS) have greatly influenced neural fields, as it enables high-fidelity rendering with impressive visual quality. However, 3DGS has difficulty accurately representing surfaces. In contrast, 2DGS transforms the 3D volume into a collection of 2D planar Gaussian disks. Despite advancements in geometric fidelity, rendering quality remains compromised, highlighting the challenge of achieving both high-quality rendering and precise geometric structures. This indicates that optimizing both geometric and rendering quality in a single training stage is currently unfeasible. To overcome this limitation, we present 2DGS-R, a new method that uses a hierarchical training approach to improve rendering quality while maintaining geometric accuracy. 2DGS-R first trains the original 2D Gaussians with the normal consistency regularization. Then 2DGS-R selects the 2D Gaussians with inadequate rendering quality and applies a novel in-place cloning operation to enhance the 2D Gaussians. Finally, we fine-tune the 2DGS-R model with opacity frozen. Experimental results show that compared to the original 2DGS, our method requires only 1\% more storage and minimal additional training time. Despite this negligible overhead, it achieves high-quality rendering results while preserving fine geometric structures. These findings indicate that our approach effectively balances efficiency with performance, leading to improvements in both visual fidelity and geometric reconstruction accuracy.
☆ From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display
Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose-aware and identity-preserving video generation framework that addresses two key challenges: the misalignment between head and body motion, and identity drift caused by temporal modeling. In particular, M2HVideo incorporates a dynamic pose-aware head encoder that fuses facial semantics with body pose to produce consistent identity embeddings across frames. To address the loss of fine facial details due to latent space compression, we introduce a mirror loss applied in pixel space through a denoising diffusion implicit model (DDIM)-based one-step denoising. Additionally, we design a distribution-aware adapter that aligns statistical distributions of identity and clothing features to enhance temporal coherence. Extensive experiments on the UBC fashion dataset, our self-constructed ASOS dataset, and the newly collected MannequinVideos dataset captured on-site demonstrate that M2HVideo achieves superior performance in terms of clothing consistency, identity preservation, and video fidelity in comparison to state-of-the-art methods.
☆ Robust Cross-Domain Adaptation in Texture Features Transferring for Wood Chip Moisture Content Prediction
Accurate and quick prediction of wood chip moisture content is critical for optimizing biofuel production and ensuring energy efficiency. The current widely used direct method (oven drying) is limited by its longer processing time and sample destructiveness. On the other hand, existing indirect methods, including near-infrared spectroscopy-based, electrical capacitance-based, and image-based approaches, are quick but not accurate when wood chips come from various sources. Variability in the source material can alter data distributions, undermining the performance of data-driven models. Therefore, there is a need for a robust approach that effectively mitigates the impact of source variability. Previous studies show that manually extracted texture features have the potential to predict wood chip moisture class. Building on this, in this study, we conduct a comprehensive analysis of five distinct texture feature types extracted from wood chip images to predict moisture content. Our findings reveal that a combined feature set incorporating all five texture features achieves an accuracy of 95% and consistently outperforms individual texture features in predicting moisture content. To ensure robust moisture prediction, we propose a domain adaptation method named AdaptMoist that utilizes the texture features to transfer knowledge from one source of wood chip data to another, addressing variability across different domains. We also proposed a criterion for model saving based on adjusted mutual information. The AdaptMoist method improves prediction accuracy across domains by 23%, achieving an average accuracy of 80%, compared to 57% for non-adapted models. These results highlight the effectiveness of AdaptMoist as a robust solution for wood chip moisture content estimation across domains, making it a potential solution for wood chip-reliant industries.
☆ ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification
Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.
☆ Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity
Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental, cultural, and geospatial variables. We address this challenge using a deep learning approach but must contend with structural label scarcity inherent to archaeology: positives are rare, and most locations are unlabeled. To address this, we adopt a semi-supervised, positive-unlabeled (PU) learning strategy, implemented as a semantic segmentation model and evaluated on two datasets covering a representative range of archaeological periods. Our approach employs dynamic pseudolabeling, refined with a Conditional Random Field (CRF) implemented via an RNN, increasing label confidence under severe class imbalance. On a geospatial dataset derived from a digital elevation model (DEM), our model performs on par with the state-of-the-art, LAMAP, while achieving higher Dice scores. On raw satellite imagery, assessed end-to-end with stratified k-fold cross-validation, it maintains performance and yields predictive surfaces with improved interpretability. Overall, our results indicate that semi-supervised learning offers a promising approach to identifying undiscovered sites across large, sparsely annotated landscapes.
☆ An RGB-D Image Dataset for Lychee Detection and Maturity Classification for Robotic Harvesting
Lychee is a high-value subtropical fruit. The adoption of vision-based harvesting robots can significantly improve productivity while reduce reliance on labor. High-quality data are essential for developing such harvesting robots. However, there are currently no consistently and comprehensively annotated open-source lychee datasets featuring fruits in natural growing environments. To address this, we constructed a dataset to facilitate lychee detection and maturity classification. Color (RGB) images were acquired under diverse weather conditions, and at different times of the day, across multiple lychee varieties, such as Nuomici, Feizixiao, Heiye, and Huaizhi. The dataset encompasses three different ripeness stages and contains 11,414 images, consisting of 878 raw RGB images, 8,780 augmented RGB images, and 1,756 depth images. The images are annotated with 9,658 pairs of lables for lychee detection and maturity classification. To improve annotation consistency, three individuals independently labeled the data, and their results were then aggregated and verified by a fourth reviewer. Detailed statistical analyses were done to examine the dataset. Finally, we performed experiments using three representative deep learning models to evaluate the dataset. It is publicly available for academic
☆ Personalized Image Filter: Mastering Your Photographic Style
Photographic style, as a composition of certain photographic concepts, is the charm behind renowned photographers. But learning and transferring photographic style need a profound understanding of how the photo is edited from the unknown original appearance. Previous works either fail to learn meaningful photographic concepts from reference images, or cannot preserve the content of the content image. To tackle these issues, we proposed a Personalized Image Filter (PIF). Based on a pretrained text-to-image diffusion model, the generative prior enables PIF to learn the average appearance of photographic concepts, as well as how to adjust them according to text prompts. PIF then learns the photographic style of reference images with the textual inversion technique, by optimizing the prompts for the photographic concepts. PIF shows outstanding performance in extracting and transferring various kinds of photographic style. Project page: https://pif.pages.dev/
☆ Unsupervised Monocular Road Segmentation for Autonomous Driving via Scene Geometry
This paper presents a fully unsupervised approach for binary road segmentation (road vs. non-road), eliminating the reliance on costly manually labeled datasets. The method leverages scene geometry and temporal cues to distinguish road from non-road regions. Weak labels are first generated from geometric priors, marking pixels above the horizon as non-road and a predefined quadrilateral in front of the vehicle as road. In a refinement stage, temporal consistency is enforced by tracking local feature points across frames and penalizing inconsistent label assignments using mutual information maximization. This enhances both precision and temporal stability. On the Cityscapes dataset, the model achieves an Intersection-over-Union (IoU) of 0.82, demonstrating high accuracy with a simple design. These findings demonstrate the potential of combining geometric constraints and temporal consistency for scalable unsupervised road segmentation in autonomous driving.
comment: 7 pages, 3 figures
Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs
Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.
☆ Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.
☆ GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation
Accurate 6D pose estimation of 3D objects is a fundamental task in computer vision, and current research typically predicts the 6D pose by establishing correspondences between 2D image features and 3D model features. However, these methods often face difficulties with textureless objects and varying illumination conditions. To overcome these limitations, we propose GS2POSE, a novel approach for 6D object pose estimation. GS2POSE formulates a pose regression algorithm inspired by the principles of Bundle Adjustment (BA). By leveraging Lie algebra, we extend the capabilities of 3DGS to develop a pose-differentiable rendering pipeline, which iteratively optimizes the pose by comparing the input image to the rendered image. Additionally, GS2POSE updates color parameters within the 3DGS model, enhancing its adaptability to changes in illumination. Compared to previous models, GS2POSE demonstrates accuracy improvements of 1.4\%, 2.8\% and 2.5\% on the T-LESS, LineMod-Occlusion and LineMod datasets, respectively.
☆ EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation
X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (LLMs) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An LLM with a hybrid decoder generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.
☆ Region in Context: Text-condition Image editing with Human-like semantic reasoning
Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: https://github.com/thuyvuphuong/Region-in-Context.git
☆ WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement
Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.
comment: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Oral
☆ End-to-end Listen, Look, Speak and Act
Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.
comment: 22 pages, 8 figures
☆ Prominence-Aware Artifact Detection and Dataset for Image Super-Resolution
Generative image super-resolution (SR) is rapidly advancing in visual quality and detail restoration. As the capacity of SR models expands, however, so does their tendency to produce artifacts: incorrect, visually disturbing details that reduce perceived quality. Crucially, their perceptual impact varies: some artifacts are barely noticeable while others strongly degrade the image. We argue that artifacts should be characterized by their prominence to human observers rather than treated as uniform binary defects. Motivated by this, we present a novel dataset of 1302 artifact examples from 11 contemporary image-SR methods, where each artifact is paired with a crowdsourced prominence score. Building on this dataset, we train a lightweight regressor that produces spatial prominence heatmaps and outperforms existing methods at detecting prominent artifacts. We release the dataset and code to facilitate prominence-aware evaluation and mitigation of SR artifacts.
☆ Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.
☆ A Comprehensive Survey on World Models for Embodied AI
Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. World models serve as internal simulators that capture environment dynamics, enabling forward and counterfactual rollouts to support perception, prediction, and decision making. This survey presents a unified framework for world models in embodied AI. Specifically, we formalize the problem setting and learning objectives, and propose a three-axis taxonomy encompassing: (1) Functionality, Decision-Coupled vs. General-Purpose; (2) Temporal Modeling, Sequential Simulation and Inference vs. Global Difference Prediction; (3) Spatial Representation, Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. We systematize data resources and metrics across robotics, autonomous driving, and general video settings, covering pixel prediction quality, state-level understanding, and task performance. Furthermore, we offer a quantitative comparison of state-of-the-art models and distill key open challenges, including the scarcity of unified datasets and the need for evaluation metrics that assess physical consistency over pixel fidelity, the trade-off between model performance and the computational efficiency required for real-time control, and the core modeling difficulty of achieving long-horizon temporal consistency while mitigating error accumulation. Finally, we maintain a curated bibliography at https://github.com/Li-Zn-H/AwesomeWorldModels.
comment: https://github.com/Li-Zn-H/AwesomeWorldModels
☆ UKANFormer: Noise-Robust Semantic Segmentation for Coral Reef Mapping via a Kolmogorov-Arnold Network-Transformer Hybrid
Coral reefs are vital yet fragile ecosystems that require accurate large-scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri-bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine-grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se-mantic segmentation model designed to achieve high-precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global-Local Transformer (GL-Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup-port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.
☆ Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
☆ Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.
comment: Project page: https://scenecot.github.io/
☆ HumanCM: One Step Human Motion Prediction
We present HumanCM, a one-step human motion prediction framework built upon consistency models. Instead of relying on multi-step denoising as in diffusion-based methods, HumanCM performs efficient single-step generation by learning a self-consistent mapping between noisy and clean motion states. The framework adopts a Transformer-based spatiotemporal architecture with temporal embeddings to model long-range dependencies and preserve motion coherence. Experiments on Human3.6M and HumanEva-I demonstrate that HumanCM achieves comparable or superior accuracy to state-of-the-art diffusion models while reducing inference steps by up to two orders of magnitude.
comment: 6 pages, 2 figures, 2 tables
☆ Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization KDD 2025
Distribution shifts between training and testing samples frequently occur in practice and impede model generalization performance. This crucial challenge thereby motivates studies on domain generalization (DG), which aim to predict the label on unseen target domain data by solely using data from source domains. It is intuitive to conceive the class-separated representations learned in contrastive learning (CL) are able to improve DG, while the reality is quite the opposite: users observe directly applying CL deteriorates the performance. We analyze the phenomenon with the insights from CL theory and discover lack of intra-class connectivity in the DG setting causes the deficiency. We thus propose a new paradigm, domain-connecting contrastive learning (DCCL), to enhance the conceptual connectivity across domains and obtain generalizable representations for DG. On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity. On the model side, to better embed the unseen test domains, we propose model anchoring to exploit the intra-class connectivity in pre-trained representations and complement the anchoring with generative transformation loss. Extensive experiments on five standard DG benchmarks are performed. The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision. The detailed model implementation and the code are provided through https://github.com/weitianxin/DCCL
comment: Accepted by KDD 2025
☆ SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation
Optical Coherence Tomography (OCT) is a widely used non-invasive imaging technique that provides detailed three-dimensional views of the retina, which are essential for the early and accurate diagnosis of ocular diseases. Consequently, OCT image analysis and processing have emerged as key research areas in biomedical imaging. However, acquiring paired datasets of clean and real-world noisy OCT images for supervised denoising models remains a formidable challenge due to intrinsic speckle noise and practical constraints in clinical imaging environments. To address these issues, we propose SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation. Our novel approach leverages only noisy OCT images by first generating pseudo-ground-truth images through self-fusion and self-supervised denoising. These refined images then serve as targets to train an ensemble of denoising models using a patch-based strategy that effectively enhances image clarity. Performance improvements are validated via metrics such as Contrast-to-Noise Ratio (CNR), Mean Square Ratio (MSR), Texture Preservation (TP), and Edge Preservation (EP) on the real-world dataset from the IEEE SPS Video and Image Processing Cup. Notably, the VIP Cup dataset contains only real-world noisy OCT images without clean references, highlighting our method's potential for improving image quality and diagnostic outcomes in clinical practice.
comment: 2025 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
☆ Pursuing Minimal Sufficiency in Spatial Reasoning
Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.
☆ Filtering of Small Components for Isosurface Generation
Let $f: \mathbb{R}^3 \rightarrow \mathbb{R}$ be a scalar field. An isosurface is a piecewise linear approximation of a level set $f^{-1}(\sigma)$ for some $\sigma \in \mathbb{R}$ built from some regular grid sampling of $f$. Isosurfaces constructed from scanned data such as CT scans or MRIs often contain extremely small components that distract from the visualization and do not form part of any geometric model produced from the data. Simple prefiltering of the data can remove such small components while having no effect on the large components that form the body of the visualization. We present experimental results on such filtering.
comment: 8 pages, 6 figures, 5 tables
♻ ☆ Beyond Uncertainty Quantification: Learning Uncertainty for Trust-Informed Neural Network Decisions - A Case Study in COVID-19 Classification
Reliable uncertainty quantification is critical in high-stakes applications, such as medical diagnosis, where confidently incorrect predictions can erode trust in automated decision-making systems. Traditional uncertainty quantification methods rely on a predefined confidence threshold to classify predictions as confident or uncertain. However, this approach assumes that predictions exceeding the threshold are trustworthy, while those below it are uncertain, without explicitly assessing the correctness of high-confidence predictions. As a result, confidently incorrect predictions may still occur, leading to misleading uncertainty assessments. To address this limitation, this study proposed an uncertainty-aware stacked neural network, which extends conventional uncertainty quantification by learning when predictions should be trusted. The framework consists of a two-tier model: the base model generates predictions with uncertainty estimates, while the meta-model learns to assign a trust flag, distinguishing confidently correct cases from those requiring expert review. The proposed approach is evaluated against the traditional threshold-based method across multiple confidence thresholds and pre-trained architectures using the COVIDx CXR-4 dataset. Results demonstrate that the proposed framework significantly reduces confidently incorrect predictions, offering a more trustworthy and efficient decision-support system for high-stakes domains.
comment: 13 pages, 5 figures, 6 tables
♻ ☆ Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling NeurIPS 2025
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 34% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.
comment: Accepted to NeurIPS 2025; Project Page: https://reverse-vlm.github.io
♻ ☆ ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.
comment: Accepted to TMLR 2025 - Project page: https://amir-aghdam.github.io/act-align/
♻ ☆ Robust Pan-Cancer Mitotic Figure Detection with YOLOv12
Mitotic figures represent a key histoprognostic feature in tumor pathology, providing crucial insights into tumor aggressiveness and proliferation. However, their identification remains challenging, subject to significant inter-observer variability, even among experienced pathologists. To address this issue, the MItosis DOmain Generalization (MIDOG) 2025 challenge marks the third edition of an international competition aiming to develop robust mitosis detection algorithms. In this paper, we present a mitotic figure detection approach based on the state-of-the-art YOLOv12 object detection architecture. Our method achieved an F1-score of 0.801 on the preliminary test set (hotspots only) and ranked second on the final test leaderboard with an F1-score of 0.7216 across complex and heterogeneous whole-slide regions, without relying on external data.
♻ ☆ Hierarchical Material Recognition from Local Appearance ICCV 2025
We introduce a taxonomy of materials for hierarchical recognition from local appearance. Our taxonomy is motivated by vision applications and is arranged according to the physical traits of materials. We contribute a diverse, in-the-wild dataset with images and depth maps of the taxonomy classes. Utilizing the taxonomy and dataset, we present a method for hierarchical material recognition based on graph attention networks. Our model leverages the taxonomic proximity between classes and achieves state-of-the-art performance. We demonstrate the model's potential to generalize to adverse, real-world imaging conditions, and that novel views rendered using the depth maps can enhance this capability. Finally, we show the model's capacity to rapidly learn new materials in a few-shot learning setting.
comment: ICCV 2025 Camera Ready
♻ ☆ Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy
In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.
comment: Code is available at https://github.com/krennic999/ARsample
♻ ☆ Enhancing Test Time Adaptation with Few-shot Guidance
Deep neural networks often encounter significant performance drops while facing with domain shifts between training (source) and test (target) data. To address this issue, Test Time Adaptation (TTA) methods have been proposed to adapt pre-trained source model to handle out-of-distribution streaming target data. Although these methods offer some relief, they lack a reliable mechanism for domain shift correction, which can often be erratic in real-world applications. In response, we develop Few-Shot Test Time Adaptation (FS-TTA), a novel and practical setting that utilizes a few-shot support set on top of TTA. Adhering to the principle of few inputs, big gains, FS-TTA reduces blind exploration in unseen target domains. Furthermore, we propose a two-stage framework to tackle FS-TTA, including (i) fine-tuning the pre-trained source model with few-shot support set, along with using feature diversity augmentation module to avoid overfitting, (ii) implementing test time adaptation based on prototype memory bank guidance to produce high quality pseudo-label for model adaptation. Through extensive experiments on three cross-domain classification benchmarks, we demonstrate the superior performance and reliability of our FS-TTA and framework.
comment: 17 pages, 8 figures
♻ ☆ Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering
Osteoporosis is a common condition that increases fracture risk, especially in older adults. Early diagnosis is vital for preventing fractures, reducing treatment costs, and preserving mobility. However, healthcare providers face challenges like limited labeled data and difficulties in processing medical images. This study presents a novel multi-modal learning framework that integrates clinical and imaging data to improve diagnostic accuracy and model interpretability. The model utilizes three pre-trained networks-VGG19, InceptionV3, and ResNet50-to extract deep features from X-ray images. These features are transformed using PCA to reduce dimensionality and focus on the most relevant components. A clustering-based selection process identifies the most representative components, which are then combined with preprocessed clinical data and processed through a fully connected network (FCN) for final classification. A feature importance plot highlights key variables, showing that Medical History, BMI, and Height were the main contributors, emphasizing the significance of patient-specific data. While imaging features were valuable, they had lower importance, indicating that clinical data are crucial for accurate predictions. This framework promotes precise and interpretable predictions, enhancing transparency and building trust in AI-driven diagnoses for clinical integration.
♻ ☆ Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark
Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks, showcasing exceptional performance. However, as these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands. To address these challenges, parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters. This paper presents a comprehensive survey of the latest advancements in the visual PEFT field, systematically reviewing current methodologies and categorizing them into four primary categories: addition-based, partial-based, unified-based, and multi-task tuning. In addition, this paper offers an in-depth analysis of widely used visual datasets and real-world applications where PEFT methods have been successfully applied. Furthermore, this paper introduces the V-PEFT Bench, a unified benchmark designed to standardize the evaluation of PEFT methods across a diverse set of vision tasks, ensuring consistency and fairness in comparison. Finally, the paper outlines potential directions for future research to propel advances in the PEFT field. A comprehensive collection of resources is available at https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning.
comment: Submitted to IEEE TPAMI
♻ ☆ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.
♻ ☆ FireANTs: Adaptive Riemannian Optimization for Multi-Scale Diffeomorphic Matching
The paper proposes FireANTs, a multi-scale Adaptive Riemannian Optimization algorithm for dense diffeomorphic image matching. Existing state-of-the-art methods for diffeomorphic image matching are slow due to inefficient implementations and slow convergence due to the ill-conditioned nature of the optimization problem. Deep learning methods offer fast inference but require extensive training time, substantial inference memory, and fail to generalize across long-tailed distributions or diverse image modalities, necessitating costly retraining. We address these challenges by proposing a training-free, GPU-accelerated multi-scale Adaptive Riemannian Optimization algorithm for fast and accurate dense diffeomorphic image matching. FireANTs runs about 2.5x faster than ANTs on a CPU, and upto 1200x faster on a GPU. On a single GPU, FireANTs performs competitively with deep learning methods on inference runtime while consuming upto 10x less memory. FireANTs shows remarkable robustness to a wide variety of matching problems across modalities, species, and organs without any domain-specific training or tuning. Our framework allows hyperparameter grid search studies with significantly less resources and time compared to traditional and deep learning registration algorithms alike.
♻ ☆ Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance
Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, identity inconsistency, temporal incoherence, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.
comment: 10 pages, 7 figures
♻ ☆ Seeing in the Dark: A Teacher-Student Framework for Dark Video Action Recognition via Knowledge Distillation and Contrastive Learning
Action recognition in dark or low-light (under-exposed) videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes ActLumos, a teacher-student framework that attains single-stream inference while retaining multi-stream level accuracy. The teacher consumes dual stream inputs, which include original dark frames and retinex-enhanced frames, processed by weight-shared R(2+1)D-34 backbones and fused by a Dynamic Feature Fusion (DFF) module, which dynamically re-weights the two streams at each time step, emphasising the most informative temporal segments. The teacher is also included with a supervised contrastive loss (SupCon) that sharpens class margins. The student shares the R(2+1)D-34 backbone but uses only dark frames and no fusion at test time. The student is first pre-trained with self-supervision on dark clips of both datasets without their labels and then fine-tuned with knowledge distillation from the teacher, transferring the teacher's multi-stream knowledge into a single-stream model. Under single-stream inference, the distilled student attains state-of-the-art accuracy of 96.92% (Top-1) on ARID V1.0, 88.27% on ARID V1.5, and 48.96% on Dark48. Ablation studies further highlight the individual contributions of each component, i.e., DFF in the teacher outperforms single or static fusion, knowledge distillation (KD) transfers these gains to the single-stream student, and two-view spatio-temporal SSL surpasses spatial-only or temporal-only variants without increasing inference cost. The official website of this work is available at: https://github.com/HrishavBakulBarua/ActLumos
♻ ☆ Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection
Out-of-distribution (OOD) detection is essential for reliably deploying machine learning models in the wild. Yet, most methods treat large pre-trained models as monolithic encoders and rely solely on their final-layer representations for detection. We challenge this wisdom. We reveal the \textit{intermediate layers} of pre-trained models, shaped by residual connections that subtly transform input projections, \textit{can} encode \textit{surprisingly rich and diverse signals} for detecting distributional shifts. Importantly, to exploit latent representation diversity across layers, we introduce an entropy-based criterion to \textit{automatically} identify layers offering the most complementary information in a training-free setting -- \textit{without access to OOD data}. We show that selectively incorporating these intermediate representations can increase the accuracy of OOD detection by up to \textbf{$10\%$} in far-OOD and over \textbf{$7\%$} in near-OOD benchmarks compared to state-of-the-art training-free methods across various model architectures and training objectives. Our findings reveal a new avenue for OOD detection research and uncover the impact of various training objectives and model architectures on confidence-based OOD detection methods.
♻ ☆ Is Artificial Intelligence Generated Image Detection a Solved Problem? NeurIPS 2025
The rapid advancement of generative models, such as GANs and Diffusion models, has enabled the creation of highly realistic synthetic images, raising serious concerns about misinformation, deepfakes, and copyright infringement. Although numerous Artificial Intelligence Generated Image (AIGI) detectors have been proposed, often reporting high accuracy, their effectiveness in real-world scenarios remains questionable. To bridge this gap, we introduce AIGIBench, a comprehensive benchmark designed to rigorously evaluate the robustness and generalization capabilities of state-of-the-art AIGI detectors. AIGIBench simulates real-world challenges through four core tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing. It includes 23 diverse fake image subsets that span both advanced and widely adopted image generation techniques, along with real-world samples collected from social media and AI art platforms. Extensive experiments on 11 advanced detectors demonstrate that, despite their high reported accuracy in controlled settings, these detectors suffer significant performance drops on real-world data, limited benefits from common augmentations, and nuanced effects of pre-processing, highlighting the need for more robust detection strategies. By providing a unified and realistic evaluation framework, AIGIBench offers valuable insights to guide future research toward dependable and generalizable AIGI detection.Data and code are publicly available at: https://github.com/HorizonTEL/AIGIBench.
comment: Accepted by NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Consistent Story Generation: Unlocking the Potential of Zigzag Sampling
Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.
comment: 20 pages, 10 figures
♻ ☆ Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion
Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, Transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.
comment: Accepted by IEEE Transactions on Image Processing (TIP)
♻ ☆ GeoCAD: Local Geometry-Controllable CAD Generation with Large Language Models NeurIPS 2025
Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at https://github.com/Zhanwei-Z/GeoCAD.
comment: Accepted by NeurIPS 2025
♻ ☆ AutoLungDx: A Hybrid Deep Learning Approach for Early Lung Cancer Diagnosis Using 3D Res-U-Net, YOLOv5, and Vision Transformers
Lung cancer is a leading cause of cancer-related deaths worldwide, and early detection is crucial for improving patient outcomes. Nevertheless, early diagnosis of cancer is a major challenge, particularly in low-resource settings where access to medical resources and trained radiologists is limited. The objective of this study is to propose an automated end-to-end deep learning-based framework for the early detection and classification of lung nodules, specifically for low-resource settings. The proposed framework consists of three stages: lung segmentation using a modified 3D U-Net named 3D Res-U-Net, nodule detection using YOLO-v5, and classification with a Vision Transformer-based architecture. We evaluated the proposed framework on a publicly available dataset, LUNA16. The proposed framework's performance was measured using the respective domain's evaluation matrices. The proposed framework achieved a 98.82% lung segmentation dice score while detecting the lung nodule with 0.76 mAP@50 from the segmented lung, at a low false-positive rate. The performance of both networks of the proposed framework was compared with other studies and found to outperform them regarding segmentation and detection accuracy. Additionally, our proposed Vision transformer network obtained an accuracy of 93.57%, which is 1.21% higher than the state-of-the-art networks. Our proposed end-to-end deep learning-based framework can effectively segment lungs, and detect and classify lung nodules, specifically in low-resource settings with limited access to radiologists. The proposed framework outperforms existing studies regarding all the respective evaluation metrics. The proposed framework can potentially improve the accuracy and efficiency of lung cancer screening in low-resource settings, ultimately leading to better patient outcomes.
♻ ☆ Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
comment: Project Page: https://jiaming-zhou.github.io/AGNOSTOS
♻ ☆ FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis
Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.
♻ ☆ Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models
As pre-trained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. This paper tackles training-free appearance transfer, which produces an image with the structure of a target image from the appearance of a reference image. Existing methods usually do not reflect semantic correspondence, as they rely on query-key similarity within the self-attention layer to establish correspondences between images. To this end, we propose explicitly rearranging the features according to the dense semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the correct color from the reference, even when the two images are not aligned.
comment: project page : https://sooyeon-go.github.io/eye_for_an_eye/
♻ ☆ Limitations of Data-Driven Spectral Reconstruction -- An Optics-Aware Analysis
Hyperspectral imaging empowers machine vision systems with the distinct capability of identifying materials through recording their spectral signatures. Recent efforts in data-driven spectral reconstruction aim at extracting spectral information from RGB images captured by cost-effective RGB cameras, instead of dedicated hardware. Published work reports exceedingly high numerical scores for this reconstruction task, yet real-world performance lags substantially behind. We systematically analyze the performance of such methods. First, we evaluate the overfitting limitations with respect to current datasets by training the networks with less data, validating the trained models with unseen yet slightly modified data and cross-dataset validation. Second, we reveal fundamental limitations in the ability of RGB to spectral methods to deal with metameric or near-metameric conditions, which have so far gone largely unnoticed due to the insufficiencies of existing datasets. We validate the trained models with metamer data generated by metameric black theory and re-training the networks with various forms of metamers. This methodology can also be used for data augmentation as a partial mitigation of the dataset issues, although the RGB to spectral inverse problem remains fundamentally ill-posed. Finally, we analyze the potential for modifying the problem setting to achieve better performance by exploiting optical encoding provided by either optical aberrations or deliberate optical design. Our experiments show such approaches provide improved results under certain circumstances, but their overall performance is limited by the same dataset issues. We conclude that future progress on snapshot spectral imaging will heavily depend on the generation of improved datasets which can then be used to design effective optical encoding strategies. Code: https://github.com/vccimaging/OpticsAwareHSI-Analysis.
comment: 16 pages, 9 figures, 9 tables
♻ ☆ MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference NeurIPS 2025
Modeling reflections from 2D images is essential for photorealistic rendering and novel view synthesis. Recent approaches enhance Gaussian primitives with reflection-related material attributes to enable physically based rendering (PBR) with Gaussian Splatting. However, the material inference often lacks sufficient constraints, especially under limited environment modeling, resulting in illumination aliasing and reduced generalization. In this work, we revisit the problem from a multi-view perspective and show that multi-view consistent material inference with more physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting. To this end, we enforce 2D Gaussians to produce multi-view consistent material maps during deferred shading. We also track photometric variations across views to identify highly reflective regions, which serve as strong priors for reflection strength terms. To handle indirect illumination caused by inter-object occlusions, we further introduce an environment modeling strategy through ray tracing with 2DGS, enabling photorealistic rendering of indirect radiance. Experiments on widely used benchmarks show that our method faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel views synthesis.
comment: Accepted by NeurIPS 2025. Project Page: https://wen-yuan-zhang.github.io/MaterialRefGS
♻ ☆ ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM
Multimodal large language models (MLLMs) frequently hallucinate by over-committing to spurious visual cues. Prior remedies-Visual and Instruction Contrastive Decoding (VCD, ICD)-mitigate this issue, yet the mechanism remains opaque. We first empirically show that their improvements systematically coincide with redistributions of cross-modal attention. Building on this insight, we propose Attention-Steerable Contrastive Decoding (ASCD), which directly steers the attention scores during decoding. ASCD combines (i) positive steering, which amplifies automatically mined text-centric heads-stable within a model and robust across domains-with (ii) negative steering, which dampens on-the-fly identified critical visual tokens. The method incurs negligible runtime and memory overhead and requires no additional training. Across five MLLM backbones and three decoding schemes, ASCD reduces hallucination on POPE, CHAIR, and MMHal-Bench by up to 38.2 percent while improving accuracy on standard VQA benchmarks, including MMMU, MM-VET, ScienceQA, TextVQA, and GQA. These results position attention steering as a simple, model-agnostic, and principled route to safer, more faithful multimodal generation.
comment: 14 pages, 8 figures
♻ ☆ Multimodal Fusion at Three Tiers: Physics-Driven Data Generation and Vision-Language Guidance for Brain Tumor Segmentation
Accurate brain tumor segmentation is crucial for neuro-oncology diagnosis and treatment planning. Deep learning methods have made significant progress, but automatic segmentation still faces challenges, including tumor morphological heterogeneity and complex three-dimensional spatial relationships. This paper proposes a three-tier fusion architecture that achieves precise brain tumor segmentation. The method processes information progressively at the pixel, feature, and semantic levels. At the pixel level, physical modeling extends magnetic resonance imaging (MRI) to multimodal data, including simulated ultrasound and synthetic computed tomography (CT). At the feature level, the method performs Transformer-based cross-modal feature fusion through multi-teacher collaborative distillation, integrating three expert teachers (MRI, US, CT). At the semantic level, clinical textual knowledge generated by GPT-4V is transformed into spatial guidance signals using CLIP contrastive learning and Feature-wise Linear Modulation (FiLM). These three tiers together form a complete processing chain from data augmentation to feature extraction to semantic guidance. We validated the method on the Brain Tumor Segmentation (BraTS) 2020, 2021, and 2023 datasets. The model achieves average Dice coefficients of 0.8665, 0.9014, and 0.8912 on the three datasets, respectively, and reduces the 95% Hausdorff Distance (HD95) by an average of 6.57 millimeters compared with the baseline. This method provides a new paradigm for precise tumor segmentation and boundary localization.
comment: 31 pages,3 figures
♻ ☆ STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding
Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB \(+\) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
comment: Code, model, and demos can be found at https://envision-research.github.io/STANCE/
♻ ☆ Agentic Design of Compositional Machines
The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. With this simplification, machine design is expressed as writing XML-like code that explicitly specifies pairwise part connections. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
comment: 75 pages, 31 figures, Project Page: https://besiegefield.github.io
♻ ☆ Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation NeurIPS 2025
In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.
comment: Accepted to NeurIPS 2025. 23 pages, with the appendix
♻ ☆ Cutting-edge 3D reconstruction solutions for underwater coral reef images: A review and comparison
Corals serve as the foundational habitat-building organisms within reef ecosystems, constructing extensive structures that extend over vast distances. However, their inherent fragility and vulnerability to various threats render them susceptible to significant damage and destruction. The application of advanced 3D reconstruction technologies for high-quality modeling is crucial for preserving them. These technologies help scientists to accurately document and monitor the state of coral reefs, including their structure, species distribution and changes over time. Photogrammetry-based approaches stand out among existing solutions, especially with recent advancements in underwater videography, photogrammetric computer vision, and machine learning. Despite continuous progress in image-based 3D reconstruction techniques, there remains a lack of systematic reviews and comprehensive evaluations of cutting-edge solutions specifically applied to underwater coral reef images. The emerging advanced methods may have difficulty coping with underwater imaging environments, complex coral structures, and computational resource constraints. They need to be reviewed and evaluated to bridge the gap between many cutting-edge technical studies and practical applications. This paper focuses on the two critical stages of these approaches: camera pose estimation and dense surface reconstruction. We systematically review and summarize classical and emerging methods, conducting comprehensive evaluations through real-world and simulated datasets. Based on our findings, we offer reference recommendations and discuss the development potential and challenges of existing approaches in depth. This work equips scientists and managers with a technical foundation and practical guidance for processing underwater coral reef images for 3D reconstruction....
♻ ☆ Dolphin v1.0 Technical Report
Ultrasound is crucial in modern medicine but faces challenges like operator dependence, image noise, and real-time scanning, hindering AI integration. While large multimodal models excel in other medical imaging areas, they struggle with ultrasound's complexities. To address this, we introduce Dolphin v1.0 (V1) and its reasoning-augmented version, Dolphin R1-the first large-scale multimodal ultrasound foundation models unifying diverse clinical tasks in a single vision-language framework.To tackle ultrasound variability and noise, we curated a 2-million-scale multimodal dataset, combining textbook knowledge, public data, synthetic samples, and general corpora. This ensures robust perception, generalization, and clinical adaptability.The Dolphin series employs a three-stage training strategy: domain-specialized pretraining, instruction-driven alignment, and reinforcement-based refinement. Dolphin v1.0 delivers reliable performance in classification, detection, regression, and report generation. Dolphin R1 enhances diagnostic inference, reasoning transparency, and interpretability through reinforcement learning with ultrasound-specific rewards.Evaluated on U2-Bench across eight ultrasound tasks, Dolphin R1 achieves a U2-score of 0.5835-over twice the second-best model (0.2968) setting a new state of the art. Dolphin v1.0 also performs competitively, validating the unified framework. Comparisons show reasoning-enhanced training significantly improves diagnostic accuracy, consistency, and interpretability, highlighting its importance for high-stakes medical AI.
♻ ☆ Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM ICCV 2025
Diffusion models have achieved impressive results in generative tasks for text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependencies across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions trained for videos, hindering their scalability and applicability. In this paper, we propose \textbf{Free$^2$Guide}, a novel gradient-free and training-free framework for aligning generated videos with text prompts. Specifically, leveraging principles from path integral control, Free$^2$Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward models. To enable image-trained LVLMs to assess text-to-video alignment, we leverage \textit{stitching} between video frames and use system prompts to capture sequential attributions. Our framework supports the flexible ensembling of multiple reward models to synergistically enhance alignment without significant computational overhead. Experimental results confirm that Free$^2$Guide using image-trained LVLMs significantly improves text-to-video alignment, thereby enhancing the overall video quality. Our results and code are available at https://kjm981995.github.io/free2guide/
comment: ICCV 2025 accepted
♻ ☆ GMatch: A Lightweight, Geometry-Constrained Keypoint Matcher for Zero-Shot 6DoF Pose Estimation in Robotic Grasp Tasks
6DoF object pose estimation is fundamental to robotic grasp tasks. While recent learning-based methods achieve high accuracy, their computational demands hinder deployment on resource-constrained mobile platforms. In this work, we revisit the classical keypoint matching paradigm and propose GMatch, a lightweight, geometry-constrained keypoint matcher that can run efficiently on embedded CPU-only platforms. GMatch works with keypoint descriptors and it uses a set of geometric constraints to establishes inherent ambiguities between features extracted by descriptors, thus giving a globally consistent correspondences from which 6DoF pose can be easily solved. We benchmark GMatch on the HOPE and YCB-Video datasets, where our method beats existing keypoint matchers (both feature-based and geometry-based) among three commonly used descriptors and approaches the SOTA zero-shot method on texture-rich objects with much more humble devices. The method is further deployed on a LoCoBot mobile manipulator, enabling a one-shot grasp pipeline that demonstrates high task success rates in real-world experiments. In a word, by its lightweight and white-box nature, GMatch offers a practical solution for resource-limited robotic systems, and although currently bottlenecked by descriptor quality, the framework presents a promising direction towards robust yet efficient pose estimation. Code will be released soon under Mozilla Public License.
comment: 6 pages + 1 page references + 1 page appendix; 6 figures; 3 tables
Machine Learning 72
☆ Video Reasoning without Training
Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.
☆ Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability
Latent component identification from unknown nonlinear mixtures is a foundational challenge in machine learning, with applications in tasks such as disentangled representation learning and causal inference. Prior work in nonlinear independent component analysis (nICA) has shown that auxiliary signals -- such as weak supervision -- can support identifiability of conditionally independent latent components. More recent approaches explore structural assumptions, e.g., sparsity in the Jacobian of the mixing function, to relax such requirements. In this work, we introduce Diverse Influence Component Analysis (DICA), a framework that exploits the convex geometry of the mixing function's Jacobian. We propose a Jacobian Volume Maximization (J-VolMax) criterion, which enables latent component identification by encouraging diversity in their influence on the observed variables. Under reasonable conditions, this approach achieves identifiability without relying on auxiliary information, latent component independence, or Jacobian sparsity assumptions. These results extend the scope of identifiability analysis and offer a complementary perspective to existing methods.
comment: 30 pages, 3 figures
☆ Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation NeurIPS 2025
We study the Quality of Service Degradation (QoSD) problem, in which an adversary perturbs edge weights to degrade network performance. This setting arises in both network infrastructures and distributed ML systems, where communication quality, not just connectivity, determines functionality. While classical methods rely on combinatorial optimization, and recent ML approaches address only restricted linear variants with small-size networks, no prior model directly tackles the QoSD problem under nonlinear edge-weight functions. This work proposes \PIMMA, a self-reinforcing generative framework that synthesizes feasible solutions in latent space, to fill this gap. Our method includes three phases: (1) Forge: a Predictive Path-Stressing (PPS) algorithm that uses graph learning and approximation to produce feasible solutions with performance guarantee, (2) Morph: a new theoretically grounded training paradigm for Mixture of Conditional VAEs guided by an energy-based model to capture solution feature distributions, and (3) Refine: a reinforcement learning agent that explores this space to generate progressively near-optimal solutions using our designed differentiable reward function. Experiments on both synthetic and real-world networks show that our approach consistently outperforms classical and ML baselines, particularly in scenarios with nonlinear cost functions where traditional methods fail to generalize.
comment: 62 pages, 19 figures, Neural Information Processing Systems (NeurIPS 2025)
☆ Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models
An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space'' with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
☆ Curiosity-driven RL for symbolic equation solving NeurIPS 2025
We explore if RL can be useful for symbolic mathematics. Previous work showed contrastive learning can solve linear equations in one variable. We show model-free PPO \cite{schulman2017proximal} augmented with curiosity-based exploration and graph-based actions can solve nonlinear equations such as those involving radicals, exponentials, and trig functions. Our work suggests curiosity-based exploration may be useful for general symbolic reasoning tasks.
comment: Accepted at the NeurIPS 2025 MATH-AI Workshop
☆ Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning
Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent. Code is available at https://github.com/OPTML-Group/Unlearn-Backdoor.
☆ Extended LSTM: Adaptive Feature Gating for Toxic Comment Classification
Toxic comment detection remains a challenging task, where transformer-based models (e.g., BERT) incur high computational costs and degrade on minority toxicity classes, while classical ensembles lack semantic adaptability. We propose xLSTM, a parameter-efficient and theoretically grounded framework that unifies cosine-similarity gating, adaptive feature prioritization, and principled class rebalancing. A learnable reference vector {v} in {R}^d modulates contextual embeddings via cosine similarity, amplifying toxic cues and attenuating benign signals to yield stronger gradients under severe class imbalance. xLSTM integrates multi-source embeddings (GloVe, FastText, BERT CLS) through a projection layer, a character-level BiLSTM for morphological cues, embedding-space SMOTE for minority augmentation, and adaptive focal loss with dynamic class weighting. On the Jigsaw Toxic Comment benchmark, xLSTM attains 96.0% accuracy and 0.88 macro-F1, outperforming BERT by 33% on threat and 28% on identity_hate categories, with 15 times fewer parameters and 50ms inference latency. Cosine gating contributes a +4.8% F1 gain in ablations. The results establish a new efficiency adaptability frontier, demonstrating that lightweight, theoretically informed architectures can surpass large pretrained models on imbalanced, domain-specific NLP tasks.
☆ Justitia: Fair and Efficient Scheduling for LLM Applications
In the era of Large Language Models (LLMs), it has been popular to launch a series of LLM inferences -- we call an LLM application -- to better solve real-world problems. When serving those applications in shared GPU servers, the schedulers are expected to attain fast application completions with guaranteed worst-case performance. However, mainstream LLM schedulers fail to behave well for LLM applications -- due to head-of-line blocking or over-constrained resource allocation. In this paper, we propose to serve LLM applications in a fair and also efficient manner. To this end, we design Justitia, a novel scheduler with three key techniques. First, given that memory is prevalently a bottleneck for mainstream inference frameworks like vLLM, Justitia models the service cost of LLM applications in a memory-centric manner. Meanwhile, it uses a simple neural network model to conduct light-weight and also accurate demand prediction. Moreover, Justitia adopts a virtual-time based fair queuing algorithm to reduce the overall performance with guaranteed worst-case delay. We have implemented Justitia atop vLLM, and experimental results involving diverse LLM applications show that it can substantially enhance the scheduling efficiency with fairness preserved.
☆ EEschematic: Multimodal-LLM Based AI Agent for Schematic Generation of Analog Circuit
Circuit schematics play a crucial role in analog integrated circuit design, serving as the primary medium for human understanding and verification of circuit functionality. While recent large language model (LLM)-based approaches have shown promise in circuit topology generation and device sizing, most rely solely on textual representations such as SPICE netlists, which lack visual interpretability for circuit designers. To address this limitation, we propose EEschematic, an AI agent for automatic analog schematic generation based on a Multimodal Large Language Model (MLLM). EEschematic integrates textual, visual, and symbolic modalities to translate SPICE netlists into schematic diagrams represented in a human-editable format. The framework uses six analog substructure examples for few-shot placement and a Visual Chain-of-Thought (VCoT) strategy to iteratively refine placement and wiring, enhancing schematic clarity and symmetry. Experimental results on representative analog circuits, including a CMOS inverter, a five-transistor operational transconductance amplifier (5T-OTA), and a telescopic cascode amplifier, demonstrate that EEschematic produces schematics with high visual quality and structural correctness.
☆ Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs NeurIPS 2025
Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.
comment: NeurIPS 2025 (spotlight)
☆ Graph4MM: Weaving Multimodal Learning with Structural Information ICML 2025
Real-world multimodal data usually exhibit complex structural relationships beyond traditional one-to-one mappings like image-caption pairs. Entities across modalities interact in intricate ways, with images and text forming diverse interconnections through contextual dependencies and co-references. Graphs provide powerful structural information for modeling intra-modal and inter-modal relationships. However, previous works fail to distinguish multi-hop neighbors and treat the graph as a standalone modality, which fragments the overall understanding. This limitation presents two key challenges in multimodal learning: (1) integrating structural information from multi-hop neighbors into foundational models, and (2) fusing modality-specific information in a principled manner. To address these challenges, we revisit the role of graphs in multimodal learning within the era of foundation models and propose Graph4MM, a graph-based multimodal learning framework. To be specific, we introduce Hop-Diffused Attention, which integrates multi-hop structural information into self-attention through causal masking and hop diffusion. Furthermore, we design MM-QFormer, a multi-mapping querying transformer for cross-modal fusion. Through theoretical and empirical analysis, we show that leveraging structures to integrate both intra- and inter-modal interactions improves multimodal understanding beyond treating them as a standalone modality. Experiments on both generative and discriminative tasks show that Graph4MM outperforms larger VLMs, LLMs, and multimodal graph baselines, achieving a 6.93% average improvement.
comment: ICML 2025
☆ Adaptive Sample Sharing for Linear Regression
In many business settings, task-specific labeled data are scarce or costly to obtain, which limits supervised learning on a specific task. To address this challenge, we study sample sharing in the case of ridge regression: leveraging an auxiliary data set while explicitly protecting against negative transfer. We introduce a principled, data-driven rule that decides how many samples from an auxiliary dataset to add to the target training set. The rule is based on an estimate of the transfer gain i.e. the marginal reduction in the predictive error. Building on this estimator, we derive finite-sample guaranties: under standard conditions, the procedure borrows when it improves parameter estimation and abstains otherwise. In the Gaussian feature setting, we analyze which data set properties ensure that borrowing samples reduces the predictive error. We validate the approach in synthetic and real datasets, observing consistent gains over strong baselines and single-task training while avoiding negative transfer.
Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection
Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.
comment: Accepted to IEEE COMPAS 2025. 6 pages, 3 figures, 6 tables
☆ One-step Diffusion Models with Bregman Density Ratio Matching
Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.
comment: work in progress
☆ MuonBP: Faster Muon via Block-Periodic Orthogonalization
Gradient orthogonalization is a simple strategy that shows great utility in speeding up gradient descent. The Muon optimizer (Jordan, Jin, et al., 2024) combines gradient orthogonalization with first-order momentum and achieves significant improvement in data efficiency over Adam/AdamW (Loshchilov and Hutter, 2019) for language model training. However, when using model parallelism, gradient orthogonalization introduces additional overhead compared to coordinate-wise optimizers (such as AdamW) due to additional gather and scatter operations on gradient matrix shards from different devices. This additional communication can amount to a throughput hit of 5%-10% compared to Adam/AdamW. To remedy this, we propose Muon with Block-Periodic Orthogonalization (MuonBP), which applies orthogonalization independently to matrix shards on each device and periodically performs full orthogonalization to maintain training stability at scale. We show how to adjust the learning rate from the baseline to MuonBP and give convergence guarantees for this algorithm. Crucially, our theory dictates that we use two stepsizes: one for the blockwise orthogonalization steps, and one for the full orthogonalization steps. Our method is simple, requires minimal hyperparameter adjustments, and achieves competitive iteration complexity compared with baseline Muon while providing per-iteration throughput comparable to coordinate-wise methods such as AdamW. When training an 8B model with eight-way tensor parallelism and ZeRO optimizer state sharding, MuonBP achieves 8% throughput increase compared to Muon with no degradation in performance.
☆ Towards Interpretable and Trustworthy Time Series Reasoning: A BlueSky Vision
Time series reasoning is emerging as the next frontier in temporal analysis, aiming to move beyond pattern recognition towards explicit, interpretable, and trustworthy inference. This paper presents a BlueSky vision built on two complementary directions. One builds robust foundations for time series reasoning, centered on comprehensive temporal understanding, structured multi-step reasoning, and faithful evaluation frameworks. The other advances system-level reasoning, moving beyond language-only explanations by incorporating multi-agent collaboration, multi-modal context, and retrieval-augmented approaches. Together, these directions outline a flexible and extensible framework for advancing time series reasoning, aiming to deliver interpretable and trustworthy temporal intelligence across diverse domains.
☆ Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees
In social sciences, small- to medium-scale datasets are common and linear regression (LR) is canonical. In privacy-aware settings, much work has focused on differentially private (DP) LR, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP LR methods do not readily support it. Mainstream SDG approaches are either tailored to discretized data, making them less suitable for continuous regression, or rely on deep models that require large datasets, limiting their use for the smaller, continuous data typical in social science. We propose a method for LR with valid inference under Gaussian DP: a DP bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure in which regression on the synthetic data matches our DP regression. Our binning-aggregation strategy is effective in small- to moderate-dimensional settings. Experiments show our method (1) improves accuracy over existing methods, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream ML tasks than current DP SDGs.
☆ Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures
Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.
comment: Code is at https://github.com/unites-lab/shadow-moe
Quantile Regression, Variational Autoencoders, and Diffusion Models for Uncertainty Quantification: A Spatial Analysis of Sub-seasonal Wind Speed Prediction
This study aims to improve the spatial representation of uncertainties when regressing surface wind speeds from large-scale atmospheric predictors for sub-seasonal forecasting. Sub-seasonal forecasting often relies on large-scale atmospheric predictors such as 500 hPa geopotential height (Z500), which exhibit higher predictability than surface variables and can be downscaled to obtain more localised information. Previous work by Tian et al. (2024) demonstrated that stochastic perturbations based on model residuals can improve ensemble dispersion representation in statistical downscaling frameworks, but this method fails to represent spatial correlations and physical consistency adequately. More sophisticated approaches are needed to capture the complex relationships between large-scale predictors and local-scale predictands while maintaining physical consistency. Probabilistic deep learning models offer promising solutions for capturing complex spatial dependencies. This study evaluates three probabilistic methods with distinct uncertainty quantification mechanisms: Quantile Regression Neural Network that directly models distribution quantiles, Variational Autoencoders that leverage latent space sampling, and Diffusion Models that utilise iterative denoising. These models are trained on ERA5 reanalysis data and applied to ECMWF sub-seasonal hindcasts to regress probabilistic wind speed ensembles. Our results show that probabilistic downscaling approaches provide more realistic spatial uncertainty representations compared to simpler stochastic methods, with each probabilistic model offering different strengths in terms of ensemble dispersion, deterministic skill, and physical consistency. These findings establish probabilistic downscaling as an effective enhancement to operational sub-seasonal wind forecasts for renewable energy planning and risk assessment.
comment: This Work has been submitted to Monthly Weather Review. Copyright in this Work may be transferred without further notice
☆ Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation
Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics like solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective. Analysis indicates that solver performance depends primarily on high constraint recall and low constraint RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing constraint RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.
☆ A Primer on Kolmogorov-Arnold Networks (KANs) for Probabilistic Time Series Forecasting
This work introduces Probabilistic Kolmogorov-Arnold Network (P-KAN), a novel probabilistic extension of Kolmogorov-Arnold Networks (KANs) for time series forecasting. By replacing scalar weights with spline-based functional connections and directly parameterizing predictive distributions, P-KANs offer expressive yet parameter-efficient models capable of capturing nonlinear and heavy-tailed dynamics. We evaluate P-KANs on satellite traffic forecasting, where uncertainty-aware predictions enable dynamic thresholding for resource allocation. Results show that P-KANs consistently outperform Multi Layer Perceptron (MLP) baselines in both accuracy and calibration, achieving superior efficiency-risk trade-offs while using significantly fewer parameters. We build up P-KANs on two distributions, namely Gaussian and Student-t distributions. The Gaussian variant provides robust, conservative forecasts suitable for safety-critical scenarios, whereas the Student-t variant yields sharper distributions that improve efficiency under stable demand. These findings establish P-KANs as a powerful framework for probabilistic forecasting with direct applicability to satellite communications and other resource-constrained domains.
☆ A Topological Approach to Parameterizing Deep Hedging Networks
Deep hedging uses recurrent neural networks to hedge financial products that cannot be fully hedged in incomplete markets. Previous work in this area focuses on minimizing some measure of quadratic hedging error by calculating pathwise gradients, but doing so requires large batch sizes and can make training effective models in a reasonable amount of time challenging. We show that by adding certain topological features, we can reduce batch sizes substantially and make training these models more practically feasible without greatly compromising hedging performance.
☆ Prediction-Augmented Trees for Reliable Statistical Inference
The remarkable success of machine learning (ML) in predictive tasks has led scientists to incorporate ML predictions as a core component of the scientific discovery pipeline. This was exemplified by the landmark achievement of AlphaFold (Jumper et al. (2021)). In this paper, we study how ML predictions can be safely used in statistical analysis of data towards scientific discovery. In particular, we follow the framework introduced by Angelopoulos et al. (2023). In this framework, we assume access to a small set of $n$ gold-standard labeled samples, a much larger set of $N$ unlabeled samples, and a ML model that can be used to impute the labels of the unlabeled data points. We introduce two new learning-augmented estimators: (1) Prediction-Augmented Residual Tree (PART), and (2) Prediction-Augmented Quadrature (PAQ). Both estimators have significant advantages over existing estimators like PPI and PPI++ introduced by Angelopoulos et al. (2023) and Angelopoulos et al. (2024), respectively. PART is a decision-tree based estimator built using a greedy criterion. We first characterize PART's asymptotic distribution and demonstrate how to construct valid confidence intervals. Then we show that PART outperforms existing methods in real-world datasets from ecology, astronomy, and census reports, among other domains. This leads to estimators with higher confidence, which is the result of using both the gold-standard samples and the machine learning predictions. Finally, we provide a formal proof of the advantage of PART by exploring PAQ, an estimation that arises when considering the limit of PART when the depth its tree grows to infinity. Under appropriate assumptions in the input data we show that the variance of PAQ shrinks at rate of $O(N^{-1} + n^{-4})$, improving significantly on the $O(N^{-1}+n^{-1})$ rate of existing methods.
comment: 45 pages, 9 Figures
☆ Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws ICLR 2026
The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion-based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.
comment: 38 pages, 12 figures. Submitted to ICLR 2026
☆ UNDREAM: Bridging Differentiable Rendering and Photorealistic Simulation for End-to-end Adversarial Attacks
Deep learning models deployed in safety critical applications like autonomous driving use simulations to test their robustness against adversarial attacks in realistic conditions. However, these simulations are non-differentiable, forcing researchers to create attacks that do not integrate simulation environmental factors, reducing attack success. To address this limitation, we introduce UNDREAM, the first software framework that bridges the gap between photorealistic simulators and differentiable renderers to enable end-to-end optimization of adversarial perturbations on any 3D objects. UNDREAM enables manipulation of the environment by offering complete control over weather, lighting, backgrounds, camera angles, trajectories, and realistic human and object movements, thereby allowing the creation of diverse scenes. We showcase a wide array of distinct physically plausible adversarial objects that UNDREAM enables researchers to swiftly explore in different configurable environments. This combination of photorealistic simulation and differentiable optimization opens new avenues for advancing research of physical adversarial attacks.
☆ SolverLLM: Leveraging Test-Time Scaling for Optimization Problem via LLM-Guided Search NeurIPS 2025
Large Language Models (LLMs) offer promising capabilities for tackling complex reasoning tasks, including optimization problems. However, existing methods either rely on prompt engineering, which leads to poor generalization across problem types, or require costly supervised training. We introduce SolverLLM, a training-free framework that leverages test-time scaling to solve diverse optimization problems. Rather than solving directly, SolverLLM generates mathematical formulations and translates them into solver-ready code, guided by a novel Monte Carlo Tree Search (MCTS) strategy. To enhance the search process, we modify classical MCTS with (1) dynamic expansion for adaptive formulation generation, (2) prompt backpropagation to guide exploration via outcome-driven feedback, and (3) uncertainty backpropagation to incorporate reward reliability into decision-making. Experiments on six standard benchmark datasets demonstrate that SolverLLM outperforms both prompt-based and learning-based baselines, achieving strong generalization without additional training.
comment: NeurIPS 2025
☆ Domain Generalizable Continual Learning
To adapt effectively to dynamic real-world environments, intelligent systems must continually acquire new skills while generalizing them to diverse, unseen scenarios. Here, we introduce a novel and realistic setting named domain generalizable continual learning (DGCL): a model learns sequential tasks with each involving a single domain, aiming to perform well across all encountered tasks and domains. This setting poses unique challenges in acquiring, retaining, and leveraging both semantic- and domain-relevant information for robust generalization. Although state-of-the-art continual learning (CL) methods have employed pre-trained models (PTMs) to enhance task-specific generalization, they typically assume identical training and testing domains for each task and therefore perform poorly in DGCL. To this end, we propose adaptive Domain Transformation (DoT), an innovative PTMs-based approach tailored to DGCL. Inspired by the distributed-plus-hub theory of the human brain, DoT disentangles semantic- and domain-relevant information in representation learning, and adaptively transforms task representations across various domains for output alignment, ensuring balanced and generalized predictions. DoT serves as a plug-in strategy that greatly facilitates state-of-the-art CL baselines under both full parameter tuning and parameter-efficient tuning paradigms in DGCL, validated by extensive experiments. Also, DoT is shown to accumulate domain-generalizable knowledge from DGCL, and ensure resource efficiency with a lightweight implementation.
comment: 25 pages
☆ A Lightweight DL Model for Smart Grid Power Forecasting with Feature and Resolution Mismatch
How can short-term energy consumption be accurately forecasted when sensor data is noisy, incomplete, and lacks contextual richness? This question guided our participation in the \textit{2025 Competition on Electric Energy Consumption Forecast Adopting Multi-criteria Performance Metrics}, which challenged teams to predict next-day power demand using real-world high-frequency data. We proposed a robust yet lightweight Deep Learning (DL) pipeline combining hourly downsizing, dual-mode imputation (mean and polynomial regression), and comprehensive normalization, ultimately selecting Standard Scaling for optimal balance. The lightweight GRU-LSTM sequence-to-one model achieves an average RMSE of 601.9~W, MAE of 468.9~W, and 84.36\% accuracy. Despite asymmetric inputs and imputed gaps, it generalized well, captured nonlinear demand patterns, and maintained low inference latency. Notably, spatiotemporal heatmap analysis reveals a strong alignment between temperature trends and predicted consumption, further reinforcing the model's reliability. These results demonstrate that targeted preprocessing paired with compact recurrent architectures can still enable fast, accurate, and deployment-ready energy forecasting in real-world conditions.
comment: 5 pages, 3 figures, The IEEE PES ISGT Middle East 2025 (ISGT-ME 2025) November 23-26th 2025, Dubai, UAE
☆ SNOMED CT-powered Knowledge Graphs for Structured Clinical Data and Diagnostic Reasoning
The effectiveness of artificial intelligence (AI) in healthcare is significantly hindered by unstructured clinical documentation, which results in noisy, inconsistent, and logically fragmented training data. To address this challenge, we present a knowledge-driven framework that integrates the standardized clinical terminology SNOMED CT with the Neo4j graph database to construct a structured medical knowledge graph. In this graph, clinical entities such as diseases, symptoms, and medications are represented as nodes, and semantic relationships such as ``caused by,'' ``treats,'' and ``belongs to'' are modeled as edges in Neo4j, with types mapped from formal SNOMED CT relationship concepts (e.g., \texttt{Causative agent}, \texttt{Indicated for}). This design enables multi-hop reasoning and ensures terminological consistency. By extracting and standardizing entity-relationship pairs from clinical texts, we generate structured, JSON-formatted datasets that embed explicit diagnostic pathways. These datasets are used to fine-tune large language models (LLMs), significantly improving the clinical logic consistency of their outputs. Experimental results demonstrate that our knowledge-guided approach enhances the validity and interpretability of AI-generated diagnostic reasoning, providing a scalable solution for building reliable AI-assisted clinical systems.
☆ Adaptive Online Learning with LSTM Networks for Energy Price Prediction
Accurate prediction of electricity prices is crucial for stakeholders in the energy market, particularly for grid operators, energy producers, and consumers. This study focuses on developing a predictive model leveraging Long Short-Term Memory (LSTM) networks to forecast day-ahead electricity prices in the California energy market. The model incorporates a variety of features, including historical price data, weather conditions, and the energy generation mix. A novel custom loss function that integrates Mean Absolute Error (MAE), Jensen-Shannon Divergence (JSD), and a smoothness penalty is introduced to enhance the prediction accuracy and interpretability. Additionally, an online learning approach is implemented to allow the model to adapt to new data incrementally, ensuring continuous relevance and accuracy. The results demonstrate that the custom loss function can improve the model's performance, aligning predicted prices more closely with actual values, particularly during peak intervals. Also, the online learning model outperforms other models by effectively incorporating real-time data, resulting in lower prediction error and variability. The inclusion of the energy generation mix further enhances the model's predictive capabilities, highlighting the importance of comprehensive feature integration. This research provides a robust framework for electricity price forecasting, offering valuable insights and tools for better decision-making in dynamic electricity markets.
☆ DeepChem Equivariant: SE(3)-Equivariant Support in an Open-Source Molecular Machine Learning Library
Neural networks that incorporate geometric relationships respecting SE(3) group transformations (e.g. rotations and translations) are increasingly important in molecular applications, such as molecular property prediction, protein structure modeling, and materials design. These models, known as SE(3)-equivariant neural networks, ensure outputs transform predictably with input coordinate changes by explicitly encoding spatial atomic positions. Although libraries such as E3NN [4] and SE(3)-TRANSFORMER [3 ] offer powerful implementations, they often require substantial deep learning or mathematical prior knowledge and lack complete training pipelines. We extend DEEPCHEM [ 13] with support for ready-to-use equivariant models, enabling scientists with minimal deep learning background to build, train, and evaluate models, such as SE(3)-Transformer and Tensor Field Networks. Our implementation includes equivariant models, complete training pipelines, and a toolkit of equivariant utilities, supported with comprehensive tests and documentation, to facilitate both application and further development of SE(3)-equivariant models.
comment: Presented at Machine Learning Symposium - BayLearn (2025)
☆ UniGTE: Unified Graph-Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains
Generalizing to unseen graph tasks without task-specific supervision is challenging: conventional graph neural networks are typically tied to a fixed label space, while large language models (LLMs) struggle to capture graph structure. We introduce UniGTE, an instruction-tuned encoder-decoder framework that unifies structural and semantic reasoning. The encoder augments a pretrained autoregressive LLM with learnable alignment tokens and a structure-aware graph-text attention mechanism, enabling it to attend jointly to a tokenized graph and a natural-language task prompt while remaining permutation-invariant to node order. This yields compact, task-aware graph representations. Conditioned solely on these representations, a frozen LLM decoder predicts and reconstructs: it outputs the task answer and simultaneously paraphrases the input graph in natural language. The reconstruction objective regularizes the encoder to preserve structural cues. UniGTE is instruction-tuned on five datasets spanning node-level, edge-level, and graph-level tasks across diverse domains, yet requires no fine-tuning at inference. It achieves new state-of-the-art zero-shot results on node classification, link prediction, graph classification, and graph regression under cross-task and cross-domain settings, demonstrating that tight integration of graph structure with LLM semantics enables robust, transferable graph reasoning.
☆ Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning
Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops \textbf{UDS (Utility-Diversity Sampling)}, a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.
☆ Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning
Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CL's effectiveness in addressing this challenge through a biologically inspired design. Code is available at https://github.com/gfyddha/Fly-CL.
☆ DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization
Vehicle aerodynamics optimization has become critical for automotive electrification, where drag reduction directly determines electric vehicle range and energy efficiency. Traditional approaches face an intractable trade-off: computationally expensive Computational Fluid Dynamics (CFD) simulations requiring weeks per design iteration, or simplified models that sacrifice production-grade accuracy. While machine learning offers transformative potential, existing datasets exhibit fundamental limitations -- inadequate mesh resolution, missing vehicle components, and validation errors exceeding 5% -- preventing deployment in industrial workflows. We present DrivAerStar, comprising 12,000 industrial-grade automotive CFD simulations generated using $\text{STAR-CCM+}^\unicode{xAE}$ software. The dataset systematically explores three vehicle configurations through 20 Computer Aided Design (CAD) parameters via Free Form Deformation (FFD) algorithms, including complete engine compartments and cooling systems with realistic internal airflow. DrivAerStar achieves wind tunnel validation accuracy below 1.04% -- a five-fold improvement over existing datasets -- through refined mesh strategies with strict wall $y^+$ control. Benchmarks demonstrate that models trained on this data achieve production-ready accuracy while reducing computational costs from weeks to minutes. This represents the first dataset bridging academic machine learning research and industrial CFD practice, establishing a new standard for data-driven aerodynamic optimization in automotive development. Beyond automotive applications, DrivAerStar demonstrates a paradigm for integrating high-fidelity physics simulations with Artificial Intelligence (AI) across engineering disciplines where computational constraints currently limit innovation.
☆ Schrödinger Bridge Mamba for One-Step Speech Enhancement
We propose Schr\"odinger Bridge Mamba (SBM), a new concept of training-inference framework motivated by the inherent compatibility between Schr\"odinger Bridge (SB) training paradigm and selective state-space model Mamba. We exemplify the concept of SBM with an implementation for generative speech enhancement. Experiments on a joint denoising and dereverberation task using four benchmark datasets demonstrate that SBM, with only 1-step inference, outperforms strong baselines with 1-step or iterative inference and achieves the best real-time factor (RTF). Beyond speech enhancement, we discuss the integration of SB paradigm and selective state-space model architecture based on their underlying alignment, which indicates a promising direction for exploring new deep generative models potentially applicable to a broad range of generative tasks. Demo page: https://sbmse.github.io
comment: 5 pages, 1 figure
♻ ☆ Enhancing the Cross-Size Generalization for Solving Vehicle Routing Problems via Continual Learning
Exploring machine learning techniques for addressing vehicle routing problems has attracted considerable research attention. To achieve decent and efficient solutions, existing deep models for vehicle routing problems are typically trained and evaluated using instances of a single size. This substantially limits their ability to generalize across different problem sizes and thus hampers their practical applicability. To address the issue, we propose a continual learning based framework that sequentially trains a deep model with instances of ascending problem sizes. Specifically, on the one hand, we design an inter-task regularization scheme to retain the knowledge acquired from smaller problem sizes in the model training on a larger size. On the other hand, we introduce an intra-task regularization scheme to consolidate the model by imitating the latest desirable behaviors during training on each size. Additionally, we exploit the experience replay to revisit instances of formerly trained sizes for mitigating the catastrophic forgetting. Experimental results show that our approach achieves predominantly superior performance across various problem sizes (either seen or unseen in the training), as compared to state-of-the-art deep models including the ones specialized for generalizability enhancement. Meanwhile, the ablation studies on the key designs manifest their synergistic effect in the proposed framework.
♻ ☆ Barron Space Representations for Elliptic PDEs with Homogeneous Boundary Conditions
We study the approximation complexity of high-dimensional second-order elliptic PDEs with homogeneous boundary conditions on the unit hypercube, within the framework of Barron spaces. Under the assumption that the coefficients belong to suitably defined Barron spaces, we prove that the solution can be efficiently approximated by two-layer neural networks, circumventing the curse of dimensionality. Our results demonstrate the expressive power of shallow networks in capturing high-dimensional PDE solutions under appropriate structural assumptions.
♻ ☆ Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs
Given an intractable distribution $p$, the problem of variational inference (VI) is to find the best approximation from some more tractable family $Q$. Commonly, one chooses $Q$ to be a family of factorized distributions (i.e., the mean-field assumption), even though $p$ itself does not factorize. We show that this mismatch can lead to an impossibility theorem: if $p$ does not factorize and furthermore has a non-diagonal covariance matrix, then any factorized approximation $q\!\in\!Q$ can correctly estimate at most one of the following three measures of uncertainty: (i) the marginal variances, (ii) the marginal precisions, or (iii) the generalized variance (which for elliptical distributions is closely related to the entropy). In practice, the best variational approximation in $Q$ is found by minimizing some divergence $D(q,p)$ between distributions, and so we ask: how does the choice of divergence determine which measure of uncertainty, if any, is correctly estimated by VI? We consider the classic Kullback-Leibler divergences, the more general $\alpha$-divergences, and a score-based divergence which compares $\nabla \log p$ and $\nabla \log q$. We thoroughly analyze the case where $p$ is a Gaussian and $q$ is a (factorized) Gaussian. In this setting, we show that all the considered divergences can be ordered based on the estimates of uncertainty they yield as objective functions for VI. Finally, we empirically evaluate the validity of this ordering when the target distribution $p$ is not Gaussian.
♻ ☆ Critically-Damped Higher-Order Langevin Dynamics for Generative Modeling
Denoising diffusion probabilistic models (DDPMs) represent an entirely new class of generative AI methods that have yet to be fully explored. They use Langevin dynamics, represented as stochastic differential equations, to describe a process that transforms data into noise, the forward process, and a process that transforms noise into generated data, the reverse process. Many of these methods utilize auxiliary variables that formulate the data as a ``position" variable, and the auxiliary variables are referred to as ``velocity", ``acceleration", etc. In this sense, it is possible to ``critically damp" the dynamics. Critical damping has been successfully introduced in Critically-Damped Langevin Dynamics (CLD) and Critically-Damped Third-Order Langevin Dynamics (TOLD++), but has not yet been applied to dynamics of arbitrary order. The proposed methodology generalizes Higher-Order Langevin Dynamics (HOLD), a recent state-of-the-art diffusion method, by introducing the concept of critical damping from systems analysis. Similarly to TOLD++, this work proposes an optimal set of hyperparameters in the $n$-dimensional case, where HOLD leaves these to be user defined. Additionally, this work provides closed-form solutions for the mean and covariance of the forward process that greatly simplify its implementation. Experiments are performed on the CIFAR-10 and CelebA-HQ $256 \times 256$ datasets, and validated against the FID metric.
comment: 10 pages
♻ ☆ ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.
comment: Accepted to TMLR 2025 - Project page: https://amir-aghdam.github.io/act-align/
♻ ☆ Subgradient Method for System Identification with Non-Smooth Objectives
This paper investigates a subgradient-based algorithm to solve the system identification problem for linear time-invariant systems with non-smooth objectives. This is essential for robust system identification in safety-critical applications. While existing work provides theoretical exact recovery guarantees using optimization solvers, the design of fast learning algorithms with convergence guarantees for practical use remains unexplored. We analyze the subgradient method in this setting, where the optimization problems to be solved evolve over time as new measurements are collected, and we establish linear convergence to the ground-truth system for both the best and Polyak step sizes after a burn-in period. We further characterize sublinear convergence of the iterates under constant and diminishing step sizes, which require only minimal information and thus offer broad applicability. Finally, we compare the time complexity of standard solvers with the subgradient algorithm and support our findings with experimental results. This is the first work to analyze subgradient algorithms for system identification with non-smooth objectives.
comment: 20 pages, 2 figures
♻ ☆ Incentivizing Truthful Language Models via Peer Elicitation Games
Large Language Models (LLMs) have demonstrated strong generative capabilities but remain prone to inconsistencies and hallucinations. We introduce Peer Elicitation Games (PEG), a training-free, game-theoretic framework for aligning LLMs through a peer elicitation mechanism involving a generator and multiple discriminators instantiated from distinct base models. Discriminators interact in a peer evaluation setting, where utilities are computed using a determinant-based mutual information score that provably incentivizes truthful reporting without requiring ground-truth labels. We establish theoretical guarantees showing that each agent, via online learning, achieves sublinear regret in the sense their cumulative performance approaches that of the best fixed truthful strategy in hindsight. Moreover, we prove last-iterate convergence to a truthful Nash equilibrium, ensuring that the actual policies used by agents converge to stable and truthful behavior over time. Empirical evaluations across multiple benchmarks demonstrate significant improvements in factual accuracy. These results position PEG as a practical approach for eliciting truthful behavior from LLMs without supervision or fine-tuning.
♻ ☆ Efficient Large Language Model Inference with Neural Block Linearization
The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs. The implementation is available at: https://github.com/LIONS-EPFL/NBL.
♻ ☆ RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM's immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.
♻ ☆ Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.
♻ ☆ Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach
Recommender systems are essential for content-sharing platforms by curating personalized content. To improve recommender systems, platforms frequently rely on creator-side randomized experiments to evaluate algorithm updates. We show that commonly adopted difference-in-means estimators can lead to severely biased estimates due to recommender interference, where treated and control creators compete for exposure. This bias can result in incorrect business decisions. To address this, we propose a ``recommender choice model'' that explicitly represents the interference pathway. The approach combines a structural choice framework with neural networks to account for rich viewer-content heterogeneity. Building on this foundation, we develop a debiased estimator using the double machine learning (DML) framework to adjust for errors from nuisance component estimation. We show that the estimator is $\sqrt{n}$-consistent and asymptotically normal, and we extend the DML theory to handle correlated data, which arise in our context due to overlapped items. We validate our method with a large-scale field experiment on Weixin short-video platform, using a costly double-sided randomization design to obtain an interference-free ground truth. Our results show that the proposed estimator successfully recovers this ground truth, whereas benchmark estimators exhibit substantial bias, and in some cases, yield reversed signs.
♻ ☆ mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method's stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.
♻ ☆ HERO: Heterogeneous Continual Graph Learning via Meta-Knowledge Distillation
Heterogeneous graph neural networks have seen rapid progress in web applications such as social networks, knowledge graphs, and recommendation systems, driven by the inherent heterogeneity of web data. However, existing methods typically assume static graphs, while real-world graphs are continuously evolving. This dynamic nature requires models to adapt to new data while preserving existing knowledge. To this end, this work introduces HERO (HEterogeneous continual gRaph learning via meta-knOwledge distillation), a unified framework for continual learning on heterogeneous graphs. HERO employs meta-adaptation, a gradient-based meta-learning strategy that provides directional guidance for rapid adaptation to new tasks with limited samples. To enable efficient and effective knowledge reuse, we propose DiSCo (Diversity Sampling with semantic Consistency), a heterogeneity-aware sampling method that maximizes target node diversity and expands subgraphs along metapaths, retaining critical semantic and structural information with minimal overhead. Furthermore, HERO incorporates heterogeneity-aware knowledge distillation, which aligns knowledge at both the node and semantic levels to balance adaptation and retention across tasks. Extensive experiments on four web-related heterogeneous graph benchmarks demonstrate that HERO substantially mitigates catastrophic forgetting while achieving efficient and consistent knowledge reuse in dynamic web environments.
comment: updated version with new LaTeX template and minor formatting revisions, added references and experimental results
♻ ☆ DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD). In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs are accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving the performance of the teacher model, student models distilled from the defensively generated outputs demonstrate catastrophically reduced performance, demonstrating DOGe as a practical safeguard against KD-based model imitation.
comment: Code is available at https://github.com/unites-lab/doge
♻ ☆ Statistical Decision Theory with Counterfactual Loss
Many researchers have applied classical statistical decision theory to evaluate treatment choices and learn optimal policies. However, because this framework is based solely on realized outcomes under chosen decisions and ignores counterfactual outcomes, it cannot assess the quality of a decision relative to feasible alternatives. For example, in bail decisions, a judge must consider not only crime prevention but also the avoidance of unnecessary burdens on arrestees. To address this limitation, we generalize standard decision theory by incorporating counterfactual losses, allowing decisions to be evaluated using all potential outcomes. The central challenge in this counterfactual statistical decision framework is identification: since only one potential outcome is observed for each unit, the associated counterfactual risk is generally not identifiable. We prove that, under the assumption of strong ignorability, the counterfactual risk is identifiable if and only if the counterfactual loss function is additive in the potential outcomes. Moreover, we demonstrate that additive counterfactual losses can yield treatment recommendations, which differ from those based on standard loss functions when the decision problem involves more than two treatment options. One interpretation of this result is that additive counterfactual losses can capture the accuracy and difficulty of a decision, whereas standard losses account for accuracy alone. Finally, we formulate a symbolic linear inverse program that, given a counterfactual loss, determines whether its risk is identifiable, without requiring data.
♻ ☆ HardNet: Hard-Constrained Neural Networks with Universal Approximation Guarantees
Incorporating prior knowledge or specifications of input-output relationships into machine learning models has attracted significant attention, as it enhances generalization from limited data and yields conforming outputs. However, most existing approaches use soft constraints by penalizing violations through regularization, which offers no guarantee of constraint satisfaction, especially on inputs far from the training distribution--an essential requirement in safety-critical applications. On the other hand, imposing hard constraints on neural networks may hinder their representational power, adversely affecting performance. To address this, we propose HardNet, a practical framework for constructing neural networks that inherently satisfy hard constraints without sacrificing model capacity. Unlike approaches that modify outputs only at inference time, HardNet enables end-to-end training with hard constraint guarantees, leading to improved performance. To the best of our knowledge, HardNet is the first method that enables efficient and differentiable enforcement of more than one input-dependent inequality constraint. It allows unconstrained optimization of the network parameters using standard algorithms by appending a differentiable closed-form enforcement layer to the network's output. Furthermore, we show that HardNet retains neural networks' universal approximation capabilities. We demonstrate its versatility and effectiveness across various applications: learning with piecewise constraints, learning optimization solvers with guaranteed feasibility, and optimizing control policies in safety-critical systems.
♻ ☆ $Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training NeurIPS 2025
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.
comment: NeurIPS 2025
♻ ☆ UFT: Unifying Supervised and Reinforcement Fine-Tuning
Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
♻ ☆ The Nondecreasing Rank
In this article the notion of the nondecreasing (ND) rank of a matrix or tensor is introduced. A tensor has an ND rank of r if it can be represented as a sum of r outer products of vectors, with each vector satisfying a monotonicity constraint. It is shown that for certain poset orderings finding an ND factorization of rank $r$ is equivalent to finding a nonnegative rank-r factorization of a transformed tensor. However, not every tensor that is monotonic has a finite ND rank. Theory is developed describing the properties of the ND rank, including typical, maximum, and border ND ranks. Highlighted also are the special settings where a matrix or tensor has an ND rank of one or two. As a means of finding low ND rank approximations to a data tensor we introduce a variant of the hierarchical alternating least squares algorithm. Low ND rank factorizations are found and interpreted for two datasets concerning the weight of pigs and a mental health survey during the COVID-19 pandemic.
comment: 29 pages, 6 figures
♻ ☆ Efficient Data Selection for Training Genomic Perturbation Models
Genomic studies face a vast hypothesis space, while interventions such as gene perturbations remain costly and time-consuming. To accelerate such experiments, gene perturbation models predict the transcriptional outcome of interventions. Since constructing the training set is challenging, active learning is often employed in a "lab-in-the-loop" process. While this strategy makes training more targeted, it is substantially slower, as it fails to exploit the inherent parallelizability of Perturb-seq experiments. Here, we focus on graph neural network-based gene perturbation models and propose a subset selection method that, unlike active learning, selects the training perturbations in one shot. Our method chooses the interventions that maximize the propagation of the supervision signal to the model. The selection criterion is defined over the input knowledge graph and is optimized with submodular maximization, ensuring a near-optimal guarantee. Experimental results across multiple datasets show that, in addition to providing months of acceleration compared to active learning, the method improves the stability of perturbation choices while maintaining competitive predictive accuracy.
comment: 17 pages
♻ ☆ Towards Principled Unsupervised Multi-Agent Reinforcement Learning
In reinforcement learning, we typically refer to unsupervised pre-training when we aim to pre-train a policy without a priori access to the task specification, i.e. rewards, to be later employed for efficient learning of downstream tasks. In single-agent settings, the problem has been extensively studied and mostly understood. A popular approach, called task-agnostic exploration, casts the unsupervised objective as maximizing the entropy of the state distribution induced by the agent's policy, from which principles and methods follow. In contrast, little is known about it in multi-agent settings, which are ubiquitous in the real world. What are the pros and cons of alternative problem formulations in this setting? How hard is the problem in theory, how can we solve it in practice? In this paper, we address these questions by first characterizing those alternative formulations and highlighting how the problem, even when tractable in theory, is non-trivial in practice. Then, we present a scalable, decentralized, trust-region policy search algorithm to address the problem in practical settings. Finally, we provide numerical validations to both corroborate the theoretical findings and pave the way for unsupervised multi-agent reinforcement learning via task-agnostic exploration in challenging domains, showing that optimizing for a specific objective, namely mixture entropy, provides an excellent trade-off between tractability and performances.
♻ ☆ Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels
We study optimal resource allocation in restless multi-armed bandits (RMABs) under unknown and non-stationary dynamics. Solving RMABs optimally is PSPACE-hard even with full knowledge of model parameters, and while the Whittle index policy offers asymptotic optimality with low computational cost, it requires access to stationary transition kernels - an unrealistic assumption in many applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Our algorithm achieves a dynamic regret of $\tilde O(T^{2/3}\tilde V^{1/3}+T^{4/5})$ for large RMABs, where $T$ is the number of episodes and $\tilde V$ is the total variation distance between consecutive transition kernels. Importantly, we handle the challenging case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. Window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.
♻ ☆ Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech
This paper makes three contributions. First, via a substantial corpus of 1,419,047 comments posted on 3,161 YouTube news videos of major US cable news outlets, we analyze how users engage with LGBTQ+ news content. Our analyses focus both on positive and negative content. In particular, we construct a fine-grained hope speech classifier that detects positive (hope speech), negative, neutral, and irrelevant content. Second, in consultation with a public health expert specializing on LGBTQ+ health, we conduct an annotation study with a balanced and diverse political representation and release a dataset of 3,750 instances with fine-grained labels and detailed annotator demographic information. Finally, beyond providing a vital resource for the LGBTQ+ community, our annotation study and subsequent in-the-wild assessments reveal (1) strong association between rater political beliefs and how they rate content relevant to a marginalized community; (2) models trained on individual political beliefs exhibit considerable in-the-wild disagreement; and (3) zero-shot large language models (LLMs) align more with liberal raters.
♻ ☆ Improving Rectified Flow with Boundary Conditions ICCV 2025
Rectified Flow offers a simple and effective approach to high-quality generative modeling by learning a velocity field. However, we identify a limitation in directly modeling the velocity with an unconstrained neural network: the learned velocity often fails to satisfy certain boundary conditions, leading to inaccurate velocity field estimations that deviate from the desired ODE. This issue is particularly critical during stochastic sampling at inference, as the score function's errors are amplified near the boundary. To mitigate this, we propose a Boundary-enforced Rectified Flow Model (Boundary RF Model), in which we enforce boundary conditions with a minimal code modification. Boundary RF Model improves performance over vanilla RF model, demonstrating 8.01% improvement in FID score on ImageNet using ODE sampling and 8.98% improvement using SDE sampling.
comment: ICCV 2025
♻ ☆ Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark
Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks, showcasing exceptional performance. However, as these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands. To address these challenges, parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters. This paper presents a comprehensive survey of the latest advancements in the visual PEFT field, systematically reviewing current methodologies and categorizing them into four primary categories: addition-based, partial-based, unified-based, and multi-task tuning. In addition, this paper offers an in-depth analysis of widely used visual datasets and real-world applications where PEFT methods have been successfully applied. Furthermore, this paper introduces the V-PEFT Bench, a unified benchmark designed to standardize the evaluation of PEFT methods across a diverse set of vision tasks, ensuring consistency and fairness in comparison. Finally, the paper outlines potential directions for future research to propel advances in the PEFT field. A comprehensive collection of resources is available at https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning.
comment: Submitted to IEEE TPAMI
♻ ☆ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.
♻ ☆ CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications
The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Safety-Guard-Dataset-v3, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-8B-v3 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. Furthermore, we show our moderately multilingual fine-tuning enables robust cross-lingual transfer and strong zero-shot generalization to unseen languages. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work advances multilingual LLM safety by enabling the development of culturally aware safety guard models.
♻ ☆ A Principled Path to Fitted Distributional Evaluation
In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted Q-evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.
♻ ☆ A Prospect-Theoretic Policy Gradient Framework for Behaviorally Nuanced Reinforcement Learning
Classical reinforcement learning (RL) typically assumes rational decision-making based on expected utility theory. However, this model has been shown to be empirically inconsistent with actual human preferences, as evidenced in psychology and behavioral economics. Cumulative Prospect Theory (CPT) provides a more nuanced model for human-based decision-making, capturing diverse attitudes and perceptions toward risk, gains, and losses. While prior work has integrated CPT with RL to solve CPT policy optimization problems, the understanding and impact of this formulation remain limited. Our contributions are as follows: (a) we derive a novel policy gradient theorem for CPT objectives, generalizing the foundational result in standard RL, (b) we design a model-free policy gradient algorithm for solving the CPT-RL problem, (c) we analyze our policy gradient estimator and prove asymptotic convergence of the algorithm to first-order stationary points, and (d) test its performance through simulations. Notably, our first-order policy gradient algorithm scales better than existing zeroth-order methods to larger state spaces. Our theoretical framework offers more flexibility to advance the integration of behavioral decision-making into RL.
♻ ☆ HypER: Hyperbolic Echo State Networks for Capturing Stretch-and-Fold Dynamics in Chaotic Flows
Forecasting chaotic dynamics beyond a few Lyapunov times is difficult because infinitesimal errors grow exponentially. Existing Echo State Networks (ESNs) mitigate this growth but employ reservoirs whose Euclidean geometry is mismatched to the stretch-and-fold structure of chaos. We introduce the Hyperbolic Embedding Reservoir (HypER), an ESN whose neurons are sampled in the Poincare ball and whose connections decay exponentially with hyperbolic distance. This negative-curvature construction embeds an exponential metric directly into the latent space, aligning the reservoir's local expansion-contraction spectrum with the system's Lyapunov directions while preserving standard ESN features such as sparsity, leaky integration, and spectral-radius control. Training is limited to a Tikhonov-regularized readout. On the chaotic Lorenz-63 and Roessler systems, and the hyperchaotic Chen-Ueta attractor, HypER consistently lengthens the mean valid-prediction horizon beyond Euclidean and graph-structured ESN baselines, with statistically significant gains confirmed over 30 independent runs; parallel results on real-world benchmarks, including heart-rate variability from the Santa Fe and MIT-BIH datasets and international sunspot numbers, corroborate its advantage. We further establish a lower bound on the rate of state divergence for HypER, mirroring Lyapunov growth.
comment: 8 pages, accepted in ECAI 2025
♻ ☆ Large Language Models are Powerful Electronic Health Record Encoders
Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity present significant challenges for traditional machine learning methods. Recently, domain-specific EHR foundation models trained on large volumes of unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited access to diverse, high-quality datasets, and inconsistencies in coding standards and clinical practices. In this study, we explore the use of general-purpose Large Language Models (LLMs) to encode EHR into high-dimensional representations for downstream clinical prediction tasks. We convert structured EHR data into Markdown-formatted plain-text documents by replacing medical codes with natural language descriptions. This enables the use of LLMs and their extensive semantic understanding and generalization capabilities as effective encoders of EHRs without requiring access to private medical training data. We show that LLM-based embeddings can often match or even surpass the performance of a specialized EHR foundation model, CLMBR-T-Base, across 15 diverse clinical tasks from the EHRSHOT benchmark. Critically, our approach requires no institution-specific training and can incorporate any medical code with a text description, whereas existing EHR foundation models operate on fixed vocabularies and can only process codes seen during pretraining. To demonstrate generalizability, we further evaluate the approach on the UK Biobank (UKB) cohort, out-of-domain for CLMBR-T-Base, whose fixed vocabulary covers only 16% of UKB codes. Notably, an LLM-based model achieves superior performance for prediction of disease onset, hospitalization, and mortality, indicating robustness to population and coding shifts.
♻ ☆ VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play NeurIPS 2025
Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy RL methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves 69.5% win rate against the strongest baseline in the 3 vs 3 task, demonstrating its potential for tackling the complex interplay between low-level control and high-level strategy. To highlight VolleyBots' sim-to-real potential, we further demonstrate the zero-shot deployment of a policy trained entirely in simulation on real-world drones.
comment: Accepted by NeurIPS 2025
♻ ☆ Observation-guided Interpolation Using Graph Neural Networks for High-Resolution Nowcasting in Switzerland
Recent advances in neural weather forecasting have shown significant potential for accurate short-term forecasts. However, adapting such gridded approaches to smaller, topographically complex regions like Switzerland introduces computational challenges, especially when aiming for high spatial (1km) and temporal (10 min) resolution. This paper presents a Graph Neural Network (GNN)-based approach for high-resolution nowcasting in Switzerland using the Anemoi framework and observational inputs. The proposed architecture combines surface observations with selected past and future numerical weather prediction (NWP) states, enabling an observation-guided interpolation strategy that enhances short-term accuracy while preserving physical consistency. We evaluate two models, one trained using local nowcasting analyses and one trained without, on multiple surface variables and compare it against operational high-resolution NWP (ICON-CH1) and nowcasting (INCA) baselines. Results over the test period show that both GNNs consistently outperform ICON-CH1 when verified against INCA analyses across most variables and lead times. Relative to the INCA forecast system, scores against INCA analyses show AI gains beyond 2h (with early-lead disadvantages attributable to INCA's warm start from the analysis), while verification against held-out stations shows no systematic degradation at short lead-times for AI models and frequent outperformance across surface variables. A comprehensive verification procedure, including spatial skill scores for precipitation, pairwise significance testing and event-based evaluation, demonstrates the operational relevance of the approach for mountainous domains. These results indicate that high-resolution, observation-guided GNNs can match or exceed the skill of established forecasting systems for short lead times, including when they are trained without nowcasting analyses.
comment: Updated 19.10.2025
♻ ☆ Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics
Transformers are the de-facto choice for sequence modelling, yet their quadratic self-attention and weak temporal bias can make long-range forecasting both expensive and brittle. We introduce FreezeTST, a lightweight hybrid that interleaves frozen random-feature (reservoir) blocks with standard trainable Transformer layers. The frozen blocks endow the network with rich nonlinear memory at no optimisation cost; the trainable layers learn to query this memory through self-attention. The design cuts trainable parameters and also lowers wall-clock training time, while leaving inference complexity unchanged. On seven standard long-term forecasting benchmarks, FreezeTST consistently matches or surpasses specialised variants such as Informer, Autoformer, and PatchTST; with substantially lower compute. Our results show that embedding reservoir principles within Transformers offers a simple, principled route to efficient long-term time-series prediction.
comment: 8 pages, 5 tables, 3 figures, accepted at ECAI 2025
♻ ☆ Seeing in the Dark: A Teacher-Student Framework for Dark Video Action Recognition via Knowledge Distillation and Contrastive Learning
Action recognition in dark or low-light (under-exposed) videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes ActLumos, a teacher-student framework that attains single-stream inference while retaining multi-stream level accuracy. The teacher consumes dual stream inputs, which include original dark frames and retinex-enhanced frames, processed by weight-shared R(2+1)D-34 backbones and fused by a Dynamic Feature Fusion (DFF) module, which dynamically re-weights the two streams at each time step, emphasising the most informative temporal segments. The teacher is also included with a supervised contrastive loss (SupCon) that sharpens class margins. The student shares the R(2+1)D-34 backbone but uses only dark frames and no fusion at test time. The student is first pre-trained with self-supervision on dark clips of both datasets without their labels and then fine-tuned with knowledge distillation from the teacher, transferring the teacher's multi-stream knowledge into a single-stream model. Under single-stream inference, the distilled student attains state-of-the-art accuracy of 96.92% (Top-1) on ARID V1.0, 88.27% on ARID V1.5, and 48.96% on Dark48. Ablation studies further highlight the individual contributions of each component, i.e., DFF in the teacher outperforms single or static fusion, knowledge distillation (KD) transfers these gains to the single-stream student, and two-view spatio-temporal SSL surpasses spatial-only or temporal-only variants without increasing inference cost. The official website of this work is available at: https://github.com/HrishavBakulBarua/ActLumos
♻ ☆ Boosting Graph Robustness Against Backdoor Attacks: An Over-Similarity Perspective
Graph Neural Networks (GNNs) have achieved notable success in tasks such as social and transportation networks. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, raising significant concerns about their reliability in real-world applications. Despite initial efforts to defend against specific graph backdoor attacks, existing defense methods face two main challenges: either the inability to establish a clear distinction between triggers and clean nodes, resulting in the removal of many clean nodes, or the failure to eliminate the impact of triggers, making it challenging to restore the target nodes to their pre-attack state. Through empirical analysis of various existing graph backdoor attacks, we observe that the triggers generated by these methods exhibit over-similarity in both features and structure. Based on this observation, we propose a novel graph backdoor defense method SimGuard. We first utilizes a similarity-based metric to detect triggers and then employs contrastive learning to train a backdoor detector that generates embeddings capable of separating triggers from clean nodes, thereby improving detection efficiency. Extensive experiments conducted on real-world datasets demonstrate that our proposed method effectively defends against various graph backdoor attacks while preserving performance on clean nodes. The code will be released upon acceptance.
comment: After discussions with one of the co-authors, it was decided that this version should not be made public at this time. To respect the co-author's perspective and ensure alignment among all authors, I am requesting the withdrawal of this article
Computer Vision and Pattern Recognition 52
☆ HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications
Hyperspectral images (HSI) promise to support a range of new applications in computer vision. Recent research has explored the feasibility of generalizable Spectral Reconstruction (SR), the problem of recovering a HSI from a natural three-channel color image in unseen scenarios. However, previous Multi-Scale Attention (MSA) works have only demonstrated sufficient generalizable results for very sparse spectra, while modern HSI sensors contain hundreds of channels. This paper introduces a novel approach to spectral reconstruction via our HYbrid knowledge Distillation and spectral Reconstruction Architecture (HYDRA). Using a Teacher model that encapsulates latent hyperspectral image data and a Student model that learns mappings from natural images to the Teacher's encoded domain, alongside a novel training method, we achieve high-quality spectral reconstruction. This addresses key limitations of prior SR models, providing SOTA performance across all metrics, including an 18\% boost in accuracy, and faster inference times than current SOTA models at various channel depths.
☆ Universal and Transferable Attacks on Pathology Foundation Models
We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.
comment: 38 Pages, 8 Figures
☆ Structured Interfaces for Automated Reasoning with 3D Scene Graphs
In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.
comment: 25 pages, 3 figures
☆ MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.
comment: Project website: https://passing2961.github.io/multiverse-project-page/
☆ Self-Supervised Learning to Fly using Efficient Semantic Segmentation and Metric Depth Estimation for Low-Cost Autonomous UAVs
This paper presents a vision-only autonomous flight system for small UAVs operating in controlled indoor environments. The system combines semantic segmentation with monocular depth estimation to enable obstacle avoidance, scene exploration, and autonomous safe landing operations without requiring GPS or expensive sensors such as LiDAR. A key innovation is an adaptive scale factor algorithm that converts non-metric monocular depth predictions into accurate metric distance measurements by leveraging semantic ground plane detection and camera intrinsic parameters, achieving a mean distance error of 14.4 cm. The approach uses a knowledge distillation framework where a color-based Support Vector Machine (SVM) teacher generates training data for a lightweight U-Net student network (1.6M parameters) capable of real-time semantic segmentation. For more complex environments, the SVM teacher can be replaced with a state-of-the-art segmentation model. Testing was conducted in a controlled 5x4 meter laboratory environment with eight cardboard obstacles simulating urban structures. Extensive validation across 30 flight tests in a real-world environment and 100 flight tests in a digital-twin environment demonstrates that the combined segmentation and depth approach increases the distance traveled during surveillance and reduces mission time while maintaining 100% success rates. The system is further optimized through end-to-end learning, where a compact student neural network learns complete flight policies from demonstration data generated by our best-performing method, achieving an 87.5% autonomous mission success rate. This work advances practical vision-based drone navigation in structured environments, demonstrating solutions for metric depth estimation and computational efficiency challenges that enable deployment on resource-constrained platforms.
☆ A Deep Learning Framework for Real-Time Image Processing in Medical Diagnostics: Enhancing Accuracy and Speed in Clinical Applications
Medical imaging plays a vital role in modern diagnostics; however, interpreting high-resolution radiological data remains time-consuming and susceptible to variability among clinicians. Traditional image processing techniques often lack the precision, robustness, and speed required for real-time clinical use. To overcome these limitations, this paper introduces a deep learning framework for real-time medical image analysis designed to enhance diagnostic accuracy and computational efficiency across multiple imaging modalities, including X-ray, CT, and MRI. The proposed system integrates advanced neural network architectures such as U-Net, EfficientNet, and Transformer-based models with real-time optimization strategies including model pruning, quantization, and GPU acceleration. The framework enables flexible deployment on edge devices, local servers, and cloud infrastructures, ensuring seamless interoperability with clinical systems such as PACS and EHR. Experimental evaluations on public benchmark datasets demonstrate state-of-the-art performance, achieving classification accuracies above 92%, segmentation Dice scores exceeding 91%, and inference times below 80 milliseconds. Furthermore, visual explanation tools such as Grad-CAM and segmentation overlays enhance transparency and clinical interpretability. These results indicate that the proposed framework can substantially accelerate diagnostic workflows, reduce clinician workload, and support trustworthy AI integration in time-critical healthcare environments.
comment: 20 pages, 4 figures
☆ VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at https://github.com/JulietChoo/VisionSelector .
comment: 22 pages, 8 figures
☆ SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.
☆ Patronus: Safeguarding Text-to-Image Models against White-Box Adversaries
Text-to-image (T2I) models, though exhibiting remarkable creativity in image generation, can be exploited to produce unsafe images. Existing safety measures, e.g., content moderation or model alignment, fail in the presence of white-box adversaries who know and can adjust model parameters, e.g., by fine-tuning. This paper presents a novel defensive framework, named Patronus, which equips T2I models with holistic protection to defend against white-box adversaries. Specifically, we design an internal moderator that decodes unsafe input features into zero vectors while ensuring the decoding performance of benign input features. Furthermore, we strengthen the model alignment with a carefully designed non-fine-tunable learning mechanism, ensuring the T2I model will not be compromised by malicious fine-tuning. We conduct extensive experiments to validate the intactness of the performance on safe content generation and the effectiveness of rejecting unsafe content generation. Results also confirm the resilience of Patronus against various fine-tuning attacks by white-box adversaries.
comment: 14 pages, 18 figures, 7 tables
☆ Fit for Purpose? Deepfake Detection in the Real World
The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.
☆ Watch Where You Move: Region-aware Dynamic Aggregation and Excitation for Gait Recognition
Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.
☆ Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions NeurIPS 2025
Despite recent advances, vision-language models trained with standard contrastive objectives still struggle with compositional reasoning -- the ability to understand structured relationships between visual and linguistic elements. This shortcoming is largely due to the tendency of the text encoder to focus on individual words rather than their relations, a limitation reinforced by contrastive training that primarily aligns words with visual objects. In this paper, we introduce REconstruction and Alignment of text Descriptions (READ), a fine-tuning method designed to enhance compositional reasoning by adding two auxiliary objectives to the contrastive learning: (1) a token-level reconstruction objective, where a frozen pre-trained decoder reconstructs alternative captions based on the embedding of the original caption; and (2) a sentence-level alignment objective, which explicitly aligns paraphrased sentences in the embedding space. We show that READ-CLIP, a model derived by applying the READ method to the pre-trained CLIP model, achieves the state-of-the-art performance across five major compositional reasoning benchmarks, outperforming the strongest conventional fine-tuning baseline by up to 4.1%. Furthermore, applying the READ to existing CLIP variants (including NegCLIP and FSC-CLIP) also improves performance on these benchmarks. Quantitative and qualitative analyses reveal that our proposed objectives -- reconstruction and alignment -- offer complementary benefits: the former encourages the encoder to capture relationships between words within a caption, while the latter ensures consistent representations for paraphrases expressed with different wording.
comment: Accepted at NeurIPS 2025 (poster). This is the camera-ready version
☆ Image Categorization and Search via a GAT Autoencoder and Representative Models
We propose a method for image categorization and retrieval that leverages graphs and a graph attention network (GAT)-based autoencoder. Our approach is representative-centric, that is, we execute the categorization and retrieval process via the representative models we construct for the images and image categories. We utilize a graph where nodes represent images (or their representatives) and edges capture similarity relationships. GAT highlights important features and relationships between images, enabling the autoencoder to construct context-aware latent representations that capture the key features of each image relative to its neighbors. We obtain category representatives from these embeddings and categorize a query image by comparing its representative to the category representatives. We then retrieve the most similar image to the query image within its identified category. We demonstrate the effectiveness of our representative-centric approach through experiments with both the GAT autoencoders and standard feature-based techniques.
comment: 10 pages, 22 figures, Under review
☆ Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch
Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.
☆ OOS-DSD: Improving Out-of-stock Detection in Retail Images using Auxiliary Tasks
Out-of-stock (OOS) detection is a very important retail verification process that aims to infer the unavailability of products in their designated areas on the shelf. In this paper, we introduce OOS-DSD, a novel deep learning-based method that advances OOS detection through auxiliary learning. In particular, we extend a well-established YOLOv8 object detection architecture with additional convolutional branches to simultaneously detect OOS, segment products, and estimate scene depth. While OOS detection and product segmentation branches are trained using ground truth data, the depth estimation branch is trained using pseudo-labeled annotations produced by the state-of-the-art (SOTA) depth estimation model Depth Anything V2. Furthermore, since the aforementioned pseudo-labeled depth estimates display relative depth, we propose an appropriate depth normalization procedure that stabilizes the training process. The experimental results show that the proposed method surpassed the performance of the SOTA OOS detection methods by 1.8% of the mean average precision (mAP). In addition, ablation studies confirm the effectiveness of auxiliary learning and the proposed depth normalization procedure, with the former increasing mAP by 3.7% and the latter by 4.2%.
☆ PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
☆ HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars
Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.
comment: ACM International Conference on Multimedia 2025
☆ NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation ICCV 2025
In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.
comment: ICCV 2025
☆ Instance-Aware Pseudo-Labeling and Class-Focused Contrastive Learning for Weakly Supervised Domain Adaptive Segmentation of Electron Microscopy
Annotation-efficient segmentation of the numerous mitochondria instances from various electron microscopy (EM) images is highly valuable for biological and neuroscience research. Although unsupervised domain adaptation (UDA) methods can help mitigate domain shifts and reduce the high costs of annotating each domain, they typically have relatively low performance in practical applications. Thus, we investigate weakly supervised domain adaptation (WDA) that utilizes additional sparse point labels on the target domain, which require minimal annotation effort and minimal expert knowledge. To take full use of the incomplete and imprecise point annotations, we introduce a multitask learning framework that jointly conducts segmentation and center detection with a novel cross-teaching mechanism and class-focused cross-domain contrastive learning. While leveraging unlabeled image regions is essential, we introduce segmentation self-training with a novel instance-aware pseudo-label (IPL) selection strategy. Unlike existing methods that typically rely on pixel-wise pseudo-label filtering, the IPL semantically selects reliable and diverse pseudo-labels with the help of the detection task. Comprehensive validations and comparisons on challenging datasets demonstrate that our method outperforms existing UDA and WDA methods, significantly narrowing the performance gap with the supervised upper bound. Furthermore, under the UDA setting, our method also achieves substantial improvements over other UDA techniques.
☆ VIPAMIN: Visual Prompt Initialization via Embedding Selection and Subspace Expansion NeurIPS 2025
In the era of large-scale foundation models, fully fine-tuning pretrained networks for each downstream task is often prohibitively resource-intensive. Prompt tuning offers a lightweight alternative by introducing tunable prompts while keeping the backbone frozen. However, existing visual prompt tuning methods often fail to specialize the prompts or enrich the representation space--especially when applied to self-supervised backbones. We show that these limitations become especially pronounced in challenging tasks and data-scarce settings, where effective adaptation is most critical. In this work, we introduce VIPAMIN, a visual prompt initialization strategy that enhances adaptation of self-supervised models by (1) aligning prompts with semantically informative regions in the embedding space, and (2) injecting novel representational directions beyond the pretrained subspace. Despite its simplicity--requiring only a single forward pass and lightweight operations--VIPAMIN consistently improves performance across diverse tasks and dataset sizes, setting a new state of the art in visual prompt tuning. Our code is available at https://github.com/iamjaekyun/vipamin.
comment: NeurIPS 2025
☆ Enhancing Rotated Object Detection via Anisotropic Gaussian Bounding Box and Bhattacharyya Distance
Detecting rotated objects accurately and efficiently is a significant challenge in computer vision, particularly in applications such as aerial imagery, remote sensing, and autonomous driving. Although traditional object detection frameworks are effective for axis-aligned objects, they often underperform in scenarios involving rotated objects due to their limitations in capturing orientation variations. This paper introduces an improved loss function aimed at enhancing detection accuracy and robustness by leveraging the Gaussian bounding box representation and Bhattacharyya distance. In addition, we advocate for the use of an anisotropic Gaussian representation to address the issues associated with isotropic variance in square-like objects. Our proposed method addresses these challenges by incorporating a rotation-invariant loss function that effectively captures the geometric properties of rotated objects. We integrate this proposed loss function into state-of-the-art deep learning-based rotated object detection detectors, and extensive experiments demonstrated significant improvements in mean Average Precision metrics compared to existing methods. The results highlight the potential of our approach to establish new benchmark in rotated object detection, with implications for a wide range of applications requiring precise and reliable object localization irrespective of orientation.
comment: Neurocomputing
☆ RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba ECCV 2024
Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.
comment: Extended version of ECCV 2024 paper arXiv:2407.01872. The dataset and code are released at https://github.com/KPeng9510/refAVA2
☆ EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning
The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.
LightGlueStick: a Fast and Robust Glue for Joint Point-Line Matching ICCV
Lines and points are complementary local features, whose combination has proven effective for applications such as SLAM and Structure-from-Motion. The backbone of these pipelines are the local feature matchers, establishing correspondences across images. Traditionally, point and line matching have been treated as independent tasks. Recently, GlueStick proposed a GNN-based network that simultaneously operates on points and lines to establish matches. While running a single joint matching reduced the overall computational complexity, the heavy architecture prevented real-time applications or deployment to edge devices. Inspired by recent progress in point matching, we propose LightGlueStick, a lightweight matcher for points and line segments. The key novel component in our architecture is the Attentional Line Message Passing (ALMP), which explicitly exposes the connectivity of the lines to the network, allowing for efficient communication between nodes. In thorough experiments we show that LightGlueStick establishes a new state-of-the-art across different benchmarks. The code is available at https://github.com/aubingazhib/LightGlueStick.
comment: Accepted at ICCVW 2025
☆ SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.
☆ REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.
☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.
comment: Accepted to AICCC 2025
☆ Demeter: A Parametric Model of Crop Plant Morphology from the Real World ICCV 2025
Learning 3D parametric shape models of objects has gained popularity in vision and graphics and has showed broad utility in 3D reconstruction, generation, understanding, and simulation. While powerful models exist for humans and animals, equally expressive approaches for modeling plants are lacking. In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. Unlike previous parametric models, Demeter handles varying shape topology across various species and models three sources of shape variation: articulation, subcomponent shape variation, and non-rigid deformation. To advance crop plant modeling, we collected a large-scale, ground-truthed dataset from a soybean farm as a testbed. Experiments show that Demeter effectively synthesizes shapes, reconstructs structures, and simulates biophysical processes. Code and data is available at https://tianhang-cheng.github.io/Demeter/.
comment: ICCV 2025
☆ iWatchRoadv2: Pothole Detection, Geospatial Mapping, and Intelligent Road Governance
Road potholes pose significant safety hazards and maintenance challenges, particularly on India's diverse and under-maintained road networks. This paper presents iWatchRoadv2, a fully automated end-to-end platform for real-time pothole detection, GPS-based geotagging, and dynamic road health visualization using OpenStreetMap (OSM). We curated a self-annotated dataset of over 7,000 dashcam frames capturing diverse Indian road conditions, weather patterns, and lighting scenarios, which we used to fine-tune the Ultralytics YOLO model for accurate pothole detection. The system synchronizes OCR-extracted video timestamps with external GPS logs to precisely geolocate each detected pothole, enriching detections with comprehensive metadata, including road segment attribution and contractor information managed through an optimized backend database. iWatchRoadv2 introduces intelligent governance features that enable authorities to link road segments with contract metadata through a secure login interface. The system automatically sends alerts to contractors and officials when road health deteriorates, supporting automated accountability and warranty enforcement. The intuitive web interface delivers actionable analytics to stakeholders and the public, facilitating evidence-driven repair planning, budget allocation, and quality assessment. Our cost-effective and scalable solution streamlines frame processing and storage while supporting seamless public engagement for urban and rural deployments. By automating the complete pothole monitoring lifecycle, from detection to repair verification, iWatchRoadv2 enables data-driven smart city management, transparent governance, and sustainable improvements in road infrastructure maintenance. The platform and live demonstration are accessible at https://smlab.niser.ac.in/project/iwatchroad.
comment: Under review
☆ Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
The development of computer-assisted surgery systems depends on large-scale, annotated datasets. Current resources for cataract surgery often lack the diversity and annotation depth needed to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos from two surgical centers, performed by surgeons with a range of experience levels. This resource is enriched with four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on the established competency rubrics like the ICO-OSCAR. The technical quality of the dataset is supported by a series of benchmarking experiments for key surgical AI tasks, including workflow recognition, scene segmentation, and automated skill assessment. Furthermore, we establish a domain adaptation baseline for the phase recognition task by training a model on a subset of surgical centers and evaluating its performance on a held-out center. The dataset and annotations are available in Google Form (https://docs.google.com/forms/d/e/1FAIpQLSfmyMAPSTGrIy2sTnz0-TMw08ZagTimRulbAQcWdaPwDy187A/viewform?usp=dialog).
comment: 20 pages, 11 figures, 11 tables. Data descriptor for the Cataract-LMM benchmark dataset. Source code and dataset are available
☆ MIRAD - A comprehensive real-world robust anomaly detection dataset for Mass Individualization
Social manufacturing leverages community collaboration and scattered resources to realize mass individualization in modern industry. However, this paradigm shift also introduces substantial challenges in quality control, particularly in defect detection. The main difficulties stem from three aspects. First, products often have highly customized configurations. Second, production typically involves fragmented, small-batch orders. Third, imaging environments vary considerably across distributed sites. To overcome the scarcity of real-world datasets and tailored algorithms, we introduce the Mass Individualization Robust Anomaly Detection (MIRAD) dataset. As the first benchmark explicitly designed for anomaly detection in social manufacturing, MIRAD captures three critical dimensions of this domain: (1) diverse individualized products with large intra-class variation, (2) data collected from six geographically dispersed manufacturing nodes, and (3) substantial imaging heterogeneity, including variations in lighting, background, and motion conditions. We then conduct extensive evaluations of state-of-the-art (SOTA) anomaly detection methods on MIRAD, covering one-class, multi-class, and zero-shot approaches. Results show a significant performance drop across all models compared with conventional benchmarks, highlighting the unresolved complexities of defect detection in real-world individualized production. By bridging industrial requirements and academic research, MIRAD provides a realistic foundation for developing robust quality control solutions essential for Industry 5.0. The dataset is publicly available at https://github.com/wu33learn/MIRAD.
comment: https://github.com/wu33learn/MIRAD
☆ Beyond Fixed Anchors: Precisely Erasing Concepts with Sibling Exclusive Counterparts
Existing concept erasure methods for text-to-image diffusion models commonly rely on fixed anchor strategies, which often lead to critical issues such as concept re-emergence and erosion. To address this, we conduct causal tracing to reveal the inherent sensitivity of erasure to anchor selection and define Sibling Exclusive Concepts as a superior class of anchors. Based on this insight, we propose \textbf{SELECT} (Sibling-Exclusive Evaluation for Contextual Targeting), a dynamic anchor selection framework designed to overcome the limitations of fixed anchors. Our framework introduces a novel two-stage evaluation mechanism that automatically discovers optimal anchors for precise erasure while identifying critical boundary anchors to preserve related concepts. Extensive evaluations demonstrate that SELECT, as a universal anchor solution, not only efficiently adapts to multiple erasure frameworks but also consistently outperforms existing baselines across key performance metrics, averaging only 4 seconds for anchor mining of a single concept.
☆ On the Provable Importance of Gradients for Language-Assisted Image Clustering ICCV2025
This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.
comment: revised and extended version of ICCV2025
☆ RL makes MLLMs see better than SFT
A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/
☆ TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement
Autoregressive Model (AR) has shown remarkable success in conditional image generation. However, these approaches for multiple reference generation struggle with decoupling different reference identities. In this work, we propose the TokenAR framework, specifically focused on a simple but effective token-level enhancement mechanism to address reference identity confusion problem. Such token-level enhancement consists of three parts, 1). Token Index Embedding clusters the tokens index for better representing the same reference images; 2). Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens; 3). The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.This token-enhancement framework significantly augments the capabilities of existing AR based methods in conditional image generation, enabling good identity consistency while preserving high quality background reconstruction. Driven by the goal of high-quality and high-diversity in multi-subject generation, we introduce the InstructAR Dataset, the first open-source, large-scale, multi-reference input, open domain image generation dataset that includes 28K training pairs, each example has two reference subjects, a relative prompt and a background with mask annotation, curated for multiple reference image generation training and evaluating. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in multiple reference image generation task. The implementation code and datasets will be made publicly. Codes are available, see https://github.com/lyrig/TokenAR
DiffusionX: Efficient Edge-Cloud Collaborative Image Generation with Multi-Round Prompt Evolution
Recent advances in diffusion models have driven remarkable progress in image generation. However, the generation process remains computationally intensive, and users often need to iteratively refine prompts to achieve the desired results, further increasing latency and placing a heavy burden on cloud resources. To address this challenge, we propose DiffusionX, a cloud-edge collaborative framework for efficient multi-round, prompt-based generation. In this system, a lightweight on-device diffusion model interacts with users by rapidly producing preview images, while a high-capacity cloud model performs final refinements after the prompt is finalized. We further introduce a noise level predictor that dynamically balances the computation load, optimizing the trade-off between latency and cloud workload. Experiments show that DiffusionX reduces average generation time by 15.8% compared with Stable Diffusion v1.5, while maintaining comparable image quality. Moreover, it is only 0.9% slower than Tiny-SD with significantly improved image quality, thereby demonstrating efficiency and scalability with minimal overhead.
♻ ☆ Dual Caption Preference Optimization for Diffusion Models
Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, within the existing preference datasets, the original caption often does not clearly favor the preferred image over the alternative, which weakens the supervision signal available during training. To address this issue, we introduce Dual Caption Preference Optimization (DCPO), a data augmentation and optimization framework that reinforces the learning signal by assigning two distinct captions to each preference pair. This encourages the model to better differentiate between preferred and less-preferred outcomes during training. We also construct Pick-Double Caption, a modified version of Pick-a-Pic v2 with separate captions for each image, and propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.
♻ ☆ The Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction
Uniform manifold approximation and projection (UMAP) is among the most popular neighbor embedding methods. The method relies on attractive and repulsive forces among high-dimensional data points to obtain a low-dimensional embedding. In this paper, we analyze the forces to reveal their effects on cluster formations and visualization and compare UMAP to its contemporaries. Repulsion emphasizes differences, controlling cluster boundaries and inter-cluster distance. Attraction is more subtle, as attractive tension between points can manifest simultaneously as attraction and repulsion in the lower-dimensional mapping. This explains the need for learning rate annealing and motivates the different treatments between attractive and repulsive terms. Moreover, by modifying attraction, we improve the consistency of cluster formation under random initialization. Overall, our analysis makes UMAP and similar embedding methods more interpretable, more robust, and more accurate.
comment: 9 page + appendix
♻ ☆ ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.
comment: Accepted by CBMI 2025 (IEEE International Conference on Content-Based Multimedia Indexing)
♻ ☆ Hierarchical Feature Learning for Medical Point Clouds via State Space Model
Deep learning-based point cloud modeling has been widely investigated as an indispensable component of general shape analysis. Recently, transformer and state space model (SSM) have shown promising capacities in point cloud learning. However, limited research has been conducted on medical point clouds, which have great potential in disease diagnosis and treatment. This paper presents an SSM-based hierarchical feature learning framework for medical point cloud understanding. Specifically, we down-sample input into multiple levels through the farthest point sampling. At each level, we perform a series of k-nearest neighbor (KNN) queries to aggregate multi-scale structural information. To assist SSM in processing point clouds, we introduce coordinate-order and inside-out scanning strategies for efficient serialization of irregular points. Point features are calculated progressively from short neighbor sequences and long point sequences through vanilla and group Point SSM blocks, to capture both local patterns and long-range dependencies. To evaluate the proposed method, we build a large-scale medical point cloud dataset named MedPointS for anatomy classification, completion, and segmentation. Extensive experiments conducted on MedPointS demonstrate that our method achieves superior performance across all tasks. The dataset is available at https://flemme-docs.readthedocs.io/en/latest/medpoints.html. Code is merged to a public medical imaging platform: https://github.com/wlsdzyzl/flemme.
comment: 10 pages, 3 figures
♻ ☆ Real-time Spatial-temporal Traversability Assessment via Feature-based Sparse Gaussian Process
Terrain analysis is critical for the practical ap- plication of ground mobile robots in real-world tasks, espe- cially in outdoor unstructured environments. In this paper, we propose a novel spatial-temporal traversability assessment method, which aims to enable autonomous robots to effectively navigate through complex terrains. Our approach utilizes sparse Gaussian processes (SGP) to extract geometric features (curvature, gradient, elevation, etc.) directly from point cloud scans. These features are then used to construct a high- resolution local traversability map. Then, we design a spatial- temporal Bayesian Gaussian kernel (BGK) inference method to dynamically evaluate traversability scores, integrating historical and real-time data while considering factors such as slope, flatness, gradient, and uncertainty metrics. GPU acceleration is applied in the feature extraction step, and the system achieves real-time performance. Extensive simulation experiments across diverse terrain scenarios demonstrate that our method outper- forms SOTA approaches in both accuracy and computational efficiency. Additionally, we develop an autonomous navigation framework integrated with the traversability map and validate it with a differential driven vehicle in complex outdoor envi- ronments. Our code will be open-source for further research and development by the community, https://github.com/ZJU-FAST-Lab/FSGP_BGK.
comment: accepted by IROS2025
♻ ☆ I2I-Mamba: Multi-modal medical image synthesis via selective state space modeling
Multi-modal medical image synthesis involves nonlinear transformation of tissue signals between source and target modalities, where tissues exhibit contextual interactions across diverse spatial distances. As such, the utility of a network architecture in synthesis depends on its ability to express the broad set of contextual features in medical images. Convolutional neural networks (CNNs) offer high local precision at the expense of poor sensitivity to long-range context. While transformers promise to alleviate this issue, they suffer from an unfavorable trade-off between sensitivity to long- versus short-range context due to the intrinsic complexity of attention filters. To effectively capture contextual features while avoiding the complexitydriven trade-offs, here we introduce a novel multi-modal synthesis method, I2I-Mamba, based on the state space modeling (SSM) framework. Focusing on high-level representations across a hybrid residual architecture, I2I-Mamba leverages novel dual-domain Mamba (ddMamba) blocks for complementary contextual modeling in image and Fourier domains, while maintaining spatial precision with convolutional layers. Diverting from conventional raster-scan trajectories, ddMamba leverages novel SSM operators based on a spiral-scan trajectory to learn context with enhanced angular isotropy and radial coverage, and a channel-mixing layer to aggregate context across the channel dimension. Comprehensive demonstrations on multi-contrast MRI and MRI-CT protocols indicate that I2I-Mamba outperforms state-of-the-art CNNs, transformers and SSMs.
comment: 12 pages, 6 figures
♻ ☆ Improvement of Spiking Neural Network with Bit Planes and Color Models
Spiking neural network (SNN) has emerged as a promising paradigm in computational neuroscience and artificial intelligence, offering advantages such as low energy consumption and small memory footprint. However, their practical adoption is constrained by several challenges, prominently among them being performance optimization. In this study, we present a novel approach to enhance the performance of SNN for images through a new coding method that exploits bit plane representation. Our proposed technique is designed to improve the accuracy of SNN without increasing model size. Also, we investigate the impacts of color models of the proposed coding process. Through extensive experimental validation, we demonstrate the effectiveness of our coding strategy in achieving performance gain across multiple datasets. To the best of our knowledge, this is the first research that considers bit planes and color models in the context of SNN. By leveraging the unique characteristics of bit planes, we hope to unlock new potentials in SNNs performance, potentially paving the way for more efficient and effective SNNs models in future researches and applications.
comment: 2024 IEEE 16th International Conference on Computational Intelligence and Communication Networks (CICN)
♻ ☆ Predicting High-precision Depth on Low-Precision Devices Using 2D Hilbert Curves
Dense depth prediction deep neural networks (DNN) have achieved impressive results for both monocular and binocular data, but still they are limited by high computational complexity, restricting their use on low-end devices. For better on-device efficiency and hardware utilization, weights and activations of the DNN should be converted to low-bit precision. However, this precision is not sufficient to represent high dynamic range depth. In this paper, we aim to overcome this limitation and restore high-precision depth from low-bit precision predictions. To achieve this, we propose to represent high dynamic range depth as two low dynamic range components of a Hilbert curve, and to train the full-precision DNN to directly predict the latter. For on-device deployment, we use standard quantization methods and add a post-processing step that reconstructs depth from the Hilbert curve components predicted in low-bit precision. Extensive experiments demonstrate that our method increases the bit precision of predicted depth by up to three bits with little computational overhead. We also observed a positive side effect of quantization error reduction by up to 4.6 times. Our method enables effective and accurate depth prediction with DNN weights and activations quantized to eight-bit precision.
comment: 19 pages, 19 figures
♻ ☆ VGGSounder: Audio-Visual Evaluations for Foundation Models ICCV
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.
comment: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2025
♻ ☆ Delta-Influence: Unlearning Poisons via Influence Functions NeurIPS
Addressing data integrity challenges, such as unlearning the effects of data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC and TRAK, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce $\Delta$-Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using as little as just one poisoned test example. $\Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows $\Delta$-Influence to detect large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against five detection algorithms and five unlearning strategies. We show that $\Delta$-Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning. Our code is publicly available at: https://github.com/Ruby-a07/delta-influence
comment: Accepted at NeurIPS Workshop on Attributing Model Behavior at Scale (ATTRIB @ NeurIPS 2024)
♻ ☆ RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics NeurIPS 2025
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes. See the project page at https://zhoues.github.io/RoboRefer.
comment: Accepted by NeurIPS 2025. Project page: https://zhoues.github.io/RoboRefer/
♻ ☆ Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image
Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To tackle these issues, we present a novel method that seamlessly integrates geometry and perception information without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we incorporate geometry and perception priors to initialize the Gaussian branches and guide their parameter optimization. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we introduce a stable Score Distillation Sampling for fine-grained prior distillation to ensure effective knowledge transfer. The model is further enhanced by a reprojection-based strategy that enforces depth consistency. Experimental results show that we outperform existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.
comment: 10 pages, 5 figures
♻ ☆ Privacy-Preserving Visual Localization with Event Cameras
We consider the problem of client-server localization, where edge device users communicate visual data with the service provider for locating oneself against a pre-built 3D map. This localization paradigm is a crucial component for location-based services in AR/VR or mobile applications, as it is not trivial to store large-scale 3D maps and process fast localization on resource-limited edge devices. Nevertheless, conventional client-server localization systems possess numerous challenges in computational efficiency, robustness, and privacy-preservation during data transmission. Our work aims to jointly solve these challenges with a localization pipeline based on event cameras. By using event cameras, our system consumes low energy and maintains small memory bandwidth. Then during localization, we propose applying event-to-image conversion and leverage mature image-based localization, which achieves robustness even in low-light or fast-moving scenes. To further enhance privacy protection, we introduce privacy protection techniques at two levels. Network level protection aims to hide the entire user's view in private scenes using a novel split inference approach, while sensor level protection aims to hide sensitive user details such as faces with light-weight filtering. Both methods involve small client-side computation and localization performance loss, while significantly mitigating the feeling of insecurity as revealed in our user study. We thus project our method to serve as a building block for practical location-based services using event cameras. Project page including the code is available through this link: https://82magnolia.github.io/event\_localization/.
comment: Published in IEEE Transactions on Image Processing (TIP)
♻ ☆ JND-Guided Light-Weight Neural Pre-Filter for Perceptual Image Coding
Just Noticeable Distortion (JND)-guided pre-filter is a promising technique for improving the perceptual compression efficiency of image coding. However, existing methods are often computationally expensive, and the field lacks standardized benchmarks for fair comparison. To address these challenges, this paper introduces a twofold contribution. First, we develop and open-source FJNDF-Pytorch, a unified benchmark for frequency-domain JND-Guided pre-filters. Second, leveraging this platform, we propose a complete learning framework for a novel, lightweight Convolutional Neural Network (CNN). Experimental results demonstrate that our proposed method achieves state-of-the-art compression efficiency, consistently outperforming competitors across multiple datasets and encoders. In terms of computational cost, our model is exceptionally lightweight, requiring only 7.15 GFLOPs to process a 1080p image, which is merely 14.1% of the cost of recent lightweight network. Our work presents a robust, state-of-the-art solution that excels in both performance and efficiency, supported by a reproducible research platform. The open-source implementation is available at https://github.com/viplab-fudan/FJNDF-Pytorch.
comment: 5 pages, 4 figures
♻ ☆ Fine-Grained Classification: Connecting Metadata via Cross-Contrastive Pre-Training
Fine-grained visual classification aims to recognize objects belonging to many subordinate categories of a supercategory, where appearance alone often fails to distinguish highly similar classes. We propose a unified framework that integrates image, text, and metadata via cross-contrastive pre-training. We first align the three modality encoders in a shared embedding space and then fine-tune the image and metadata encoders for classification. On NABirds, our approach improves over the baseline by 7.83% and achieves 84.44% top-1 accuracy, outperforming strong multimodal methods.
comment: 5 pages, 4 figures. Accepted at IEEE ISCMI 2025
♻ ☆ EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images
For scene understanding in unstructured environments, an accurate and uncertainty-aware metric-semantic mapping is required to enable informed action selection by autonomous systems. Existing mapping methods often suffer from overconfident semantic predictions, and sparse and noisy depth sensing, leading to inconsistent map representations. In this paper, we therefore introduce EvidMTL, a multi-task learning framework that uses evidential heads for depth estimation and semantic segmentation, enabling uncertainty-aware inference from monocular RGB images. To enable uncertainty-calibrated evidential multi-task learning, we propose a novel evidential depth loss function that jointly optimizes the belief strength of the depth prediction in conjunction with evidential segmentation loss. Building on this, we present EvidKimera, an uncertainty-aware semantic surface mapping framework, which uses evidential depth and semantics prediction for improved 3D metric-semantic consistency. We train and evaluate EvidMTL on the NYUDepthV2 and assess its zero-shot performance on ScanNetV2, demonstrating superior uncertainty estimation compared to conventional approaches while maintaining comparable depth estimation and semantic segmentation. In zero-shot mapping tests on ScanNetV2, EvidKimera outperforms Kimera in semantic surface mapping accuracy and consistency, highlighting the benefits of uncertainty-aware mapping and underscoring its potential for real-world robotic applications.
comment: Submitted to IROS 2025 Conference
Computer Vision and Pattern Recognition 145
☆ OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
comment: Technical Report. Code: https://github.com/NVlabs/OmniVinci
☆ Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbf{Skyfall-GS}, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: https://skyfall-gs.jayinnn.dev/
comment: Project page: https://skyfall-gs.jayinnn.dev/
LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal ICCV 2025
Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: https://ray-1026.github.io/lightsout/
comment: ICCV 2025. Project page: https://ray-1026.github.io/lightsout/
☆ BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models
The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model's performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.
comment: 10 Pages + 15 Supplementary Material Pages, 5 figures
☆ BLIP3o-NEXT: Next Frontier of Native Image Generation
We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.
☆ Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
☆ 3DPR: Single Image 3D Portrait Relight using Generative Priors SIGGRAPH
Rendering novel, relit views of a human head, given a monocular portrait image as input, is an inherently underconstrained problem. The traditional graphics solution is to explicitly decompose the input image into geometry, material and lighting via differentiable rendering; but this is constrained by the multiple assumptions and approximations of the underlying models and parameterizations of these scene components. We propose 3DPR, an image-based relighting model that leverages generative priors learnt from multi-view One-Light-at-A-Time (OLAT) images captured in a light stage. We introduce a new diverse and large-scale multi-view 4K OLAT dataset of 139 subjects to learn a high-quality prior over the distribution of high-frequency face reflectance. We leverage the latent space of a pre-trained generative head model that provides a rich prior over face geometry learnt from in-the-wild image datasets. The input portrait is first embedded in the latent manifold of such a model through an encoder-based inversion process. Then a novel triplane-based reflectance network trained on our lightstage data is used to synthesize high-fidelity OLAT images to enable image-based relighting. Our reflectance network operates in the latent space of the generative head model, crucially enabling a relatively small number of lightstage images to train the reflectance model. Combining the generated OLATs according to a given HDRI environment maps yields physically accurate environmental relighting results. Through quantitative and qualitative evaluations, we demonstrate that 3DPR outperforms previous methods, particularly in preserving identity and in capturing lighting effects such as specularities, self-shadows, and subsurface scattering. Project Page: https://vcai.mpi-inf.mpg.de/projects/3dpr/
comment: Accepted at ACM SIGGRAPH ASIA 2025 Conference Proceedings
☆ Paper2Web: Let's Make Your Paper Alive!
Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.
comment: Under Review. Check https://github.com/YuhangChen1/Paper2All for the unified platform to streamline all academic presentation
☆ Neuro-Symbolic Spatial Reasoning in Segmentation
Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., , and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.
☆ VISTA: A Test-Time Self-Improving Video Generation Agent
Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
☆ ERNet: Efficient Non-Rigid Registration Network for Point Sequences ICCV 2025
Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose ERNet, an efficient feed-forward model trained on large deformation datasets. It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates frame-wise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach (i) outperforms previous state-of-the-art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.
comment: Accepted to ICCV 2025. Project Page: https://guangzhaohe.com/ernet
☆ ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection NeurIPS 2025
The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at https://github.com/haoweiz23/ReCon .
comment: Accepted to NeurIPS 2025 (spotlight)
☆ Controlling the image generation process with parametric activation functions
As image generative models continue to increase not only in their fidelity but also in their ubiquity the development of tools that leverage direct interaction with their internal mechanisms in an interpretable way has received little attention In this work we introduce a system that allows users to develop a better understanding of the model through interaction and experimentation By giving users the ability to replace activation functions of a generative network with parametric ones and a way to set the parameters of these functions we introduce an alternative approach to control the networks output We demonstrate the use of our method on StyleGAN2 and BigGAN networks trained on FFHQ and ImageNet respectively.
comment: 5 pages, 5 figures, accepted for the 16th International Conference on Computational Creativity, ICCC'25
☆ SANR: Scene-Aware Neural Representation for Light Field Image Compression with Rate-Distortion Optimization
Light field images capture multi-view scene information and play a crucial role in 3D scene reconstruction. However, their high-dimensional nature results in enormous data volumes, posing a significant challenge for efficient compression in practical storage and transmission scenarios. Although neural representation-based methods have shown promise in light field image compression, most approaches rely on direct coordinate-to-pixel mapping through implicit neural representation (INR), often neglecting the explicit modeling of scene structure. Moreover, they typically lack end-to-end rate-distortion optimization, limiting their compression efficiency. To address these limitations, we propose SANR, a Scene-Aware Neural Representation framework for light field image compression with end-to-end rate-distortion optimization. For scene awareness, SANR introduces a hierarchical scene modeling block that leverages multi-scale latent codes to capture intrinsic scene structures, thereby reducing the information gap between INR input coordinates and the target light field image. From a compression perspective, SANR is the first to incorporate entropy-constrained quantization-aware training (QAT) into neural representation-based light field image compression, enabling end-to-end rate-distortion optimization. Extensive experiment results demonstrate that SANR significantly outperforms state-of-the-art techniques regarding rate-distortion performance with a 65.62\% BD-rate saving against HEVC.
☆ Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model
Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.
☆ QSilk: Micrograin Stabilization and Adaptive Quantile Clipping for Detail-Friendly Latent Diffusion
We present QSilk, a lightweight, always-on stabilization layer for latent diffusion that improves high-frequency fidelity while suppressing rare activation spikes. QSilk combines (i) a per-sample micro clamp that gently limits extreme values without washing out texture, and (ii) Adaptive Quantile Clip (AQClip), which adapts the allowed value corridor per region. AQClip can operate in a proxy mode using local structure statistics or in an attention entropy guided mode (model confidence). Integrated into the CADE 2.5 rendering pipeline, QSilk yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead. It requires no training or fine-tuning and exposes minimal user controls. We report consistent qualitative improvements across SD/SDXL backbones and show synergy with CFG/Rescale, enabling slightly higher guidance without artifacts.
comment: Preprint. Qualitative side-by-side comparisons (fixed seeds); 3 figures with subfigures; 1 algorithm. CADE 2.5 / SDXL integration; sample images included. Code and presets planned for release upon publication
☆ Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity
Poultry farming faces increasing pressure to meet productivity targets while ensuring animal welfare and environmental compliance. Yet many small and medium-sized farms lack affordable, integrated tools for continuous monitoring and decision-making, relying instead on manual, reactive inspections. This paper presents Poultry Farm Intelligence (PoultryFI) - a modular, cost-effective platform that integrates six AI-powered modules: Camera Placement Optimizer, Audio-Visual Monitoring, Analytics & Alerting, Real-Time Egg Counting, Production & Profitability Forecasting, and a Recommendation Module. Camera layouts are first optimized offline using evolutionary algorithms for full poultry house coverage with minimal hardware. The Audio-Visual Monitoring module extracts welfare indicators from synchronized video, audio, and feeding data. Analytics & Alerting produces daily summaries and real-time notifications, while Real-Time Egg Counting uses an edge vision model to automate production tracking. Forecasting models predict egg yield and feed consumption up to 10 days in advance, and the Recommendation Module integrates forecasts with weather data to guide environmental and operational adjustments. This is among the first systems to combine low-cost sensing, edge analytics, and prescriptive AI to continuously monitor flocks, predict production, and optimize performance. Field trials demonstrate 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting. PoultryFI bridges the gap between isolated pilot tools and scalable, farm-wide intelligence, empowering producers to proactively safeguard welfare and profitability.
☆ Semantic segmentation with coarse annotations
Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.
☆ NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation
Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.
comment: 10 pages, 8 figures, accepted by ACMMM 2025
☆ SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior ICCV-2025
In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution Paradigm for Content-Aware Layout Generation. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module performs fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the model to enhance its layout planning ability. Besides, we present GenPoster-100K that is a new large-scale poster dataset with rich meta-information annotation. The experiments demonstrate the effectiveness of our approach by achieving the state-of-the-art results on multiple benchmark datasets. Our project page is at: https://brucew91.github.io/SEGA.github.io/
comment: Accepted by ICCV-2025, Our project website is at: https://brucew91.github.io/SEGA.github.io/, 10 pages
☆ Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.
comment: Project page: https://ezioby.github.io/Ditto_page Code: https://github.com/EzioBy/Ditto
☆ Fix False Transparency by Noise Guided Splatting
Opaque objects reconstructed by 3DGS often exhibit a falsely transparent surface, leading to inconsistent background and internal patterns under camera motion in interactive viewing. This issue stems from the ill-posed optimization in 3DGS. During training, background and foreground Gaussians are blended via alpha-compositing and optimized solely against the input RGB images using a photometric loss. As this process lacks an explicit constraint on surface opacity, the optimization may incorrectly assign transparency to opaque regions, resulting in view-inconsistent and falsely transparent. This issue is difficult to detect in standard evaluation settings but becomes particularly evident in object-centric reconstructions under interactive viewing. Although other causes of view-inconsistency have been explored recently, false transparency has not been explicitly identified. To the best of our knowledge, we are the first to identify, characterize, and develop solutions for this artifact, an underreported artifact in 3DGS. Our strategy, NGS, encourages surface Gaussians to adopt higher opacity by injecting opaque noise Gaussians in the object volume during training, requiring only minimal modifications to the existing splatting process. To quantitatively evaluate false transparency in static renderings, we propose a transmittance-based metric that measures the severity of this artifact. In addition, we introduce a customized, high-quality object-centric scan dataset exhibiting pronounced transparency issues, and we augment popular existing datasets with complementary infill noise specifically designed to assess the robustness of 3D reconstruction methods to false transparency. Experiments across multiple datasets show that NGS substantially reduces false transparency while maintaining competitive performance on standard rendering metrics, demonstrating its overall effectiveness.
☆ DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.
comment: 9 pages, accepted at ACMMM2025 SUMAC
☆ Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.
☆ Towards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI
Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.
comment: 10 pages, 5 figures, BraTS GoAT 2025 challenge
☆ Valeo Near-Field: a novel dataset for pedestrian intent detection
This paper presents a novel dataset aimed at detecting pedestrians' intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.
☆ Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation
Automatic medical image segmentation is a fundamental step in computer-aided diagnosis, yet fully supervised approaches demand extensive pixel-level annotations that are costly and time-consuming. To alleviate this burden, we propose a weakly supervised segmentation framework that leverages only four extreme points as annotation. Specifically, bounding boxes derived from the extreme points are used as prompts for the Segment Anything Model 2 (SAM2) to generate reliable initial pseudo labels. These pseudo labels are progressively refined by an enhanced Feature-Guided Extreme Point Masking (FGEPM) algorithm, which incorporates Monte Carlo dropout-based uncertainty estimation to construct a unified gradient uncertainty cost map for boundary tracing. Furthermore, a dual-branch Uncertainty-aware Scale Consistency (USC) loss and a box alignment loss are introduced to ensure spatial consistency and precise boundary alignment during training. Extensive experiments on two public ultrasound datasets, BUSI and UNS, demonstrate that our method achieves performance comparable to, and even surpassing fully supervised counterparts while significantly reducing annotation cost. These results validate the effectiveness and practicality of the proposed weakly supervised framework for ultrasound image segmentation.
☆ Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey
Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.
comment: 30 pages, 7 figures
Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration MICCAI 2025
Current self-supervised denoising techniques achieve impressive results, yet their real-world application is frequently constrained by substantial computational and memory demands, necessitating a compromise between inference speed and reconstruction quality. In this paper, we present an ultra-lightweight model that addresses this challenge, achieving both fast denoising and high quality image restoration. Built upon the Noise2Noise training framework-which removes the reliance on clean reference images or explicit noise modeling-we introduce an innovative multistage denoising pipeline named Noise2Detail (N2D). During inference, this approach disrupts the spatial correlations of noise patterns to produce intermediate smooth structures, which are subsequently refined to recapture fine details directly from the noisy input. Extensive testing reveals that Noise2Detail surpasses existing dataset-free techniques in performance, while requiring only a fraction of the computational resources. This combination of efficiency, low computational cost, and data-free approach make it a valuable tool for biomedical imaging, overcoming the challenges of scarce clean training data-due to rare and complex imaging modalities-while enabling fast inference for practical use.
comment: 10 pages, MICCAI 2025
Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection
Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: https://reality.tf.fau.de/pub/ardelean2025quantized.html
comment: 13 pages, 10 figures. Published in the 30th Intl. Conference on Vision, Modeling, and Visualization (VMV), 2025
☆ FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.
☆ Context-aware deep learning using individualized prior information reduces false positives in disease risk prediction and longitudinal health assessment
Temporal context in medicine is valuable in assessing key changes in patient health over time. We developed a machine learning framework to integrate diverse context from prior visits to improve health monitoring, especially when prior visits are limited and their frequency is variable. Our model first estimates initial risk of disease using medical data from the most recent patient visit, then refines this assessment using information digested from previously collected imaging and/or clinical biomarkers. We applied our framework to prostate cancer (PCa) risk prediction using data from a large population (28,342 patients, 39,013 magnetic resonance imaging scans, 68,931 blood tests) collected over nearly a decade. For predictions of the risk of clinically significant PCa at the time of the visit, integrating prior context directly converted false positives to true negatives, increasing overall specificity while preserving high sensitivity. False positive rates were reduced progressively from 51% to 33% when integrating information from up to three prior imaging examinations, as compared to using data from a single visit, and were further reduced to 24% when also including additional context from prior clinical data. For predicting the risk of PCa within five years of the visit, incorporating prior context reduced false positive rates still further (64% to 9%). Our findings show that information collected over time provides relevant context to enhance the specificity of medical risk prediction. For a wide range of progressive conditions, sufficient reduction of false positive rates using context could offer a pathway to expand longitudinal health monitoring programs to large populations with comparatively low baseline risk of disease, leading to earlier detection and improved health outcomes.
comment: 18 pages, 5 figures, 1 table
☆ Standardization for improved Spatio-Temporal Image Fusion
Spatio-Temporal Image Fusion (STIF) methods usually require sets of images with matching spatial and spectral resolutions captured by different sensors. To facilitate the application of STIF methods, we propose and compare two different standardization approaches. The first method is based on traditional upscaling of the fine-resolution images. The second method is a sharpening approach called Anomaly Based Satellite Image Standardization (ABSIS) that blends the overall features found in the fine-resolution image series with the distinctive attributes of a specific coarse-resolution image to produce images that more closely resemble the outcome of aggregating the fine-resolution images. Both methods produce a significant increase in accuracy of the Unpaired Spatio Temporal Fusion of Image Patches (USTFIP) STIF method, with the sharpening approach increasing the spectral and spatial accuracies of the fused images by up to 49.46\% and 78.40\%, respectively.
Lightweight CycleGAN Models for Cross-Modality Image Transformation and Experimental Quality Assessment in Fluorescence Microscopy
Lightweight deep learning models offer substantial reductions in computational cost and environmental impact, making them crucial for scientific applications. We present a lightweight CycleGAN for modality transfer in fluorescence microscopy (confocal to super-resolution STED/deconvolved STED), addressing the common challenge of unpaired datasets. By replacing the traditional channel-doubling strategy in the U-Net-based generator with a fixed channel approach, we drastically reduce trainable parameters from 41.8 million to approximately nine thousand, achieving superior performance with faster training and lower memory usage. We also introduce the GAN as a diagnostic tool for experimental and labeling quality. When trained on high-quality images, the GAN learns the characteristics of optimal imaging; deviations between its generated outputs and new experimental images can reveal issues such as photobleaching, artifacts, or inaccurate labeling. This establishes the model as a practical tool for validating experimental accuracy and image fidelity in microscopy workflows.
comment: 17 pages, 8 Figures
☆ Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images
DeepFake technology has advanced significantly in recent years, enabling the creation of highly realistic synthetic face images. Existing DeepFake detection methods often struggle with pose variations, occlusions, and artifacts that are difficult to detect in real-world conditions. To address these challenges, we propose a multi-view architecture that enhances DeepFake detection by analyzing facial features at multiple levels. Our approach integrates three specialized encoders, a global view encoder for detecting boundary inconsistencies, a middle view encoder for analyzing texture and color alignment, and a local view encoder for capturing distortions in expressive facial regions such as the eyes, nose, and mouth, where DeepFake artifacts frequently occur. Additionally, we incorporate a face orientation encoder, trained to classify face poses, ensuring robust detection across various viewing angles. By fusing features from these encoders, our model achieves superior performance in detecting manipulated images, even under challenging pose and lighting conditions.Experimental results on challenging datasets demonstrate the effectiveness of our method, outperforming conventional single-view approaches
☆ Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation
Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at https://github.com/HiHiAllen/Imaginarium.
☆ ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents ICDAR2025
This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText's suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.
comment: 18 pages, accepted at ICDAR2025 DALL
Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics
Positron emission tomography (PET) with 18F-Fluorodeoxyglucose (FDG) is an established tool in the diagnostic workup of patients with suspected dementing disorders. However, compared to the routinely available magnetic resonance imaging (MRI), FDG-PET remains significantly less accessible and substantially more expensive. Here, we present SiM2P, a 3D diffusion bridge-based framework that learns a probabilistic mapping from MRI and auxiliary patient information to simulate FDG-PET images of diagnostic quality. In a blinded clinical reader study, two neuroradiologists and two nuclear medicine physicians rated the original MRI and SiM2P-simulated PET images of patients with Alzheimer's disease, behavioral-variant frontotemporal dementia, and cognitively healthy controls. SiM2P significantly improved the overall diagnostic accuracy of differentiating between three groups from 75.0% to 84.7% (p<0.05). Notably, the simulated PET images received higher diagnostic certainty ratings and achieved superior interrater agreement compared to the MRI images. Finally, we developed a practical workflow for local deployment of the SiM2P framework. It requires as few as 20 site-specific cases and only basic demographic information. This approach makes the established diagnostic benefits of FDG-PET imaging more accessible to patients with suspected dementing disorders, potentially improving early detection and differential diagnosis in resource-limited settings. Our code is available at https://github.com/Yiiitong/SiM2P.
☆ An Empirical Study on MC Dropout--Based Uncertainty--Error Correlation in 2D Brain Tumor Segmentation
Accurate brain tumor segmentation from MRI is vital for diagnosis and treatment planning. Although Monte Carlo (MC) Dropout is widely used to estimate model uncertainty, its effectiveness in identifying segmentation errors -- especially near tumor boundaries -- remains unclear. This study empirically examines the relationship between MC Dropout--based uncertainty and segmentation error in 2D brain tumor MRI segmentation using a U-Net trained under four augmentation settings: none, horizontal flip, rotation, and scaling. Uncertainty was computed from 50 stochastic forward passes and correlated with pixel-wise errors using Pearson and Spearman coefficients. Results show weak global correlations ($r \approx 0.30$--$0.38$) and negligible boundary correlations ($|r| < 0.05$). Although differences across augmentations were statistically significant ($p < 0.001$), they lacked practical relevance. These findings suggest that MC Dropout uncertainty provides limited cues for boundary error localization, underscoring the need for alternative or hybrid uncertainty estimation methods in medical image segmentation.
comment: Code and results available at https://github.com/Saumya4321/mc-dropout-boundary-correlation
☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
☆ Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training
This work presents a systematic investigation of custom convolutional neural network architectures for satellite land use classification, achieving 97.23% test accuracy on the EuroSAT dataset without reliance on pre-trained models. Through three progressive architectural iterations (baseline: 94.30%, CBAM-enhanced: 95.98%, and balanced multi-task attention: 97.23%) we identify and address specific failure modes in satellite imagery classification. Our principal contribution is a novel balanced multi-task attention mechanism that combines Coordinate Attention for spatial feature extraction with Squeeze-Excitation blocks for spectral feature extraction, unified through a learnable fusion parameter. Experimental results demonstrate that this learnable parameter autonomously converges to alpha approximately 0.57, indicating near-equal importance of spatial and spectral modalities for satellite imagery. We employ progressive DropBlock regularization (5-20% by network depth) and class-balanced loss weighting to address overfitting and confusion pattern imbalance. The final 12-layer architecture achieves Cohen's Kappa of 0.9692 with all classes exceeding 94.46% accuracy, demonstrating confidence calibration with a 24.25% gap between correct and incorrect predictions. Our approach achieves performance within 1.34% of fine-tuned ResNet-50 (98.57%) while requiring no external data, validating the efficacy of systematic architectural design for domain-specific applications. Complete code, trained models, and evaluation scripts are publicly available.
comment: 7 pages, 2 figures, 2 tables. Code and trained models available at https://github.com/virAditya/satellite-image-classification-eurosat
☆ Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models
Modern face recognition models achieve high overall accuracy but continue to exhibit systematic biases that disproportionately affect certain subpopulations. Conventional bias evaluation frameworks rely on labeled attributes to form subpopulations, which are expensive to obtain and limited to predefined categories. We introduce Latent Feature Alignment (LFA), an attribute-label-free algorithm that uses latent directions to identify subpopulations. This yields two main benefits over standard clustering: (i) semantically coherent grouping, where faces sharing common attributes are grouped together more reliably than by proximity-based methods, and (ii) discovery of interpretable directions, which correspond to semantic attributes such as age, ethnicity, or attire. Across four state-of-the-art recognition models (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence, while uncovering interpretable latent directions aligned with demographic and contextual attributes. These results position LFA as a practical method for representation auditing of face recognition models, enabling practitioners to identify and interpret biased subpopulations without predefined attribute annotations.
☆ Exploring Conditions for Diffusion models in Robotic Control
While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
comment: Project page: https://orca-rc.github.io/
☆ Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
Low-light RAW image enhancement remains a challenging task. Although numerous deep learning based approaches have been proposed, they still suffer from inherent limitations. A key challenge is how to simultaneously achieve strong enhancement quality and high efficiency. In this paper, we rethink the architecture for efficient low-light image signal processing (ISP) and introduce a Hierarchical Mixing Architecture (HiMA). HiMA leverages the complementary strengths of Transformer and Mamba modules to handle features at large and small scales, respectively, thereby improving efficiency while avoiding the ambiguities observed in prior two-stage frameworks. To further address uneven illumination with strong local variations, we propose Local Distribution Adjustment (LoDA), which adaptively aligns feature distributions across different local regions. In addition, to fully exploit the denoised outputs from the first stage, we design a Multi-prior Fusion (MPF) module that integrates spatial and frequency-domain priors for detail enhancement. Extensive experiments on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior performance with fewer parameters. Code will be released at https://github.com/Cynicarlos/HiMA.
☆ Iterative Motion Compensation for Canonical 3D Reconstruction from UAV Plant Images Captured in Windy Conditions
3D phenotyping of plants plays a crucial role for understanding plant growth, yield prediction, and disease control. We present a pipeline capable of generating high-quality 3D reconstructions of individual agricultural plants. To acquire data, a small commercially available UAV captures images of a selected plant. Apart from placing ArUco markers, the entire image acquisition process is fully autonomous, controlled by a self-developed Android application running on the drone's controller. The reconstruction task is particularly challenging due to environmental wind and downwash of the UAV. Our proposed pipeline supports the integration of arbitrary state-of-the-art 3D reconstruction methods. To mitigate errors caused by leaf motion during image capture, we use an iterative method that gradually adjusts the input images through deformation. Motion is estimated using optical flow between the original input images and intermediate 3D reconstructions rendered from the corresponding viewpoints. This alignment gradually reduces scene motion, resulting in a canonical representation. After a few iterations, our pipeline improves the reconstruction of state-of-the-art methods and enables the extraction of high-resolution 3D meshes. We will publicly release the source code of our reconstruction pipeline. Additionally, we provide a dataset consisting of multiple plants from various crops, captured across different points in time.
☆ A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition
Facial micro-expressions are brief, involuntary facial movements that reveal hidden emotions. Most Micro-Expression Recognition (MER) methods that rely on optical flow typically focus on the onset-to-apex phase, neglecting the apex-to-offset phase, which holds key temporal dynamics. This study introduces a Combined Optical Flow (COF), integrating both phases to enhance feature representation. COF provides a more comprehensive motion analysis, improving MER performance. Experimental results on CASMEII and SAMM datasets show that COF outperforms single optical flow-based methods, demonstrating its effectiveness in capturing micro-expression dynamics.
☆ MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.
☆ MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes
Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.
comment: 8 pages, 11 figures
☆ Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation
Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, typically lasting less than half a second. Recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. Although deep learning has enabled significant advances in micro-expression recognition (MER), its effectiveness is limited by the scarcity of annotated ME datasets. This data limitation not only hinders generalization but also restricts the diversity of motion patterns captured during training. Existing MER studies predominantly rely on simple spatial augmentations (e.g., flipping, rotation) and overlook temporal augmentation strategies that can better exploit motion characteristics. To address this gap, this paper proposes a phase-aware temporal augmentation method based on dynamic image. Rather than encoding the entire expression as a single onset-to-offset dynamic image (DI), our approach decomposes each expression sequence into two motion phases: onset-to-apex and apex-to-offset. A separate DI is generated for each phase, forming a Dual-phase DI augmentation strategy. These phase-specific representations enrich motion diversity and introduce complementary temporal cues that are crucial for recognizing subtle facial transitions. Extensive experiments on CASME-II and SAMM datasets using six deep architectures, including CNNs, Vision Transformer, and the lightweight LEARNet, demonstrate consistent performance improvements in recognition accuracy, unweighted F1-score, and unweighted average recall, which are crucial for addressing class imbalance in MER. When combined with spatial augmentations, our method achieves up to a 10\% relative improvement. The proposed augmentation is simple, model-agnostic, and effective in low-resource settings, offering a promising direction for robust and generalizable MER.
☆ DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking
Existing nighttime aerial trackers based on prompt learning rely solely on spatial localization supervision, which fails to provide fine-grained cues that point to target features and inevitably produces vague prompts. This limitation impairs the tracker's ability to accurately focus on the object features and results in trackers still performing poorly. To address this issue, we propose DPTrack, a prompt-based aerial tracker designed for nighttime scenarios by encoding the given object's attribute features into the directional kernel enriched with fine-grained cues to generate precise prompts. Specifically, drawing inspiration from visual bionics, DPTrack first hierarchically captures the object's topological structure, leveraging topological attributes to enrich the feature representation. Subsequently, an encoder condenses these topology-aware features into the directional kernel, which serves as the core guidance signal that explicitly encapsulates the object's fine-grained attribute cues. Finally, a kernel-guided prompt module built on channel-category correspondence attributes propagates the kernel across the features of the search region to pinpoint the positions of target features and convert them into precise prompts, integrating spatial gating for robust nighttime tracking. Extensive evaluations on established benchmarks demonstrate DPTrack's superior performance. Our code will be available at https://github.com/zzq-vipsl/DPTrack.
☆ MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention
Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8\%, 96.5\%, and 92.8\% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.
☆ Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More." Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.
comment: Preprint, Under review
☆ Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
Despite the remarkable success of the end-to-end paradigm in deep learning, it often suffers from slow convergence and heavy reliance on large-scale datasets, which fundamentally limits its efficiency and applicability in data-scarce domains such as medical imaging. In this work, we introduce the Predictive-Corrective (PC) paradigm, a framework that decouples the modeling task to fundamentally accelerate learning. Building upon this paradigm, we propose a novel network, termed PCMambaNet. PCMambaNet is composed of two synergistic modules. First, the Predictive Prior Module (PPM) generates a coarse approximation at low computational cost, thereby anchoring the search space. Specifically, the PPM leverages anatomical knowledge-bilateral symmetry-to predict a 'focus map' of diagnostically relevant asymmetric regions. Next, the Corrective Residual Network (CRN) learns to model the residual error, focusing the network's full capacity on refining these challenging regions and delineating precise pathological boundaries. Extensive experiments on high-resolution brain MRI segmentation demonstrate that PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs-a performance unattainable by conventional end-to-end models. This dramatic acceleration highlights that by explicitly incorporating domain knowledge to simplify the learning objective, PCMambaNet effectively mitigates data inefficiency and overfitting.
☆ Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety
Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.
comment: 11 pages, 10 figures, The 8th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '25), November 3--6, 2025, Minneapolis, MN, USA
☆ Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
☆ Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning
Clinical adoption of multi-shot diffusion-weighted magnetic resonance imaging (multi-shot DWI) for body-wide tumor diagnostics is limited by severe motion-induced phase artifacts from respiration, peristalsis, and so on, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities. Here, we introduce a reconstruction framework, LoSP-Prompt, that overcomes these challenges through physics-informed modeling and synthetic-data-driven prompt learning. We model inter-shot phase variations as a high-order Locally Smooth Phase (LoSP), integrated into a low-rank Hankel matrix reconstruction. Crucially, the algorithm's rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion. Validated across 10,000+ clinical images (43 subjects, 4 scanner models, 5 centers), LoSP-Prompt: (1) Achieved twice the spatial resolution of clinical single-shot DWI, enhancing liver lesion conspicuity; (2) Generalized to seven diverse anatomical regions (liver, kidney, sacroiliac, pelvis, knee, spinal cord, brain) with a single model; (3) Outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (11 radiologists' evaluations on a 5-point scale, $p<0.05$), achieving 4-5 points (excellent) on kidney DWI, 4 points (good to excellent) on liver, sacroiliac and spinal cord DWI, and 3-4 points (good) on knee and tumor brain. The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI. Its scanner-agnostic performance signifies transformative potential for precision oncology.
comment: 43 pages, 27 figures
☆ MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment
Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.
☆ LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding
Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/
☆ PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction
Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.
☆ FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers ICCV2025
Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.
comment: Accepted to ICCV2025
☆ Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning
Minimally invasive surgery can benefit significantly from automated surgical tool detection, enabling advanced analysis and assistance. However, the limited availability of annotated data in surgical settings poses a challenge for training robust deep learning models. This paper introduces a novel staged adaptive fine-tuning approach consisting of two steps: a linear probing stage to condition additional classification layers on a pre-trained CNN-based architecture and a gradual freezing stage to dynamically reduce the fine-tunable layers, aiming to regulate adaptation to the surgical domain. This strategy reduces network complexity and improves efficiency, requiring only a single training loop and eliminating the need for multiple iterations. We validated our method on the Cholec80 dataset, employing CNN architectures (ResNet-50 and DenseNet-121) pre-trained on ImageNet for detecting surgical tools in cholecystectomy endoscopic videos. Our results demonstrate that our method improves detection performance compared to existing approaches and established fine-tuning techniques, achieving a mean average precision (mAP) of 96.4%. To assess its broader applicability, the generalizability of the fine-tuning strategy was further confirmed on the CATARACTS dataset, a distinct domain of minimally invasive ophthalmic surgery. These findings suggest that gradual freezing fine-tuning is a promising technique for improving tool presence detection in diverse surgical procedures and may have broader applications in general image classification tasks.
☆ Cortical-SSM: A Deep State Space Model for EEG and ECoG Motor Imagery Decoding
Classification of electroencephalogram (EEG) and electrocorticogram (ECoG) signals obtained during motor imagery (MI) has substantial application potential, including for communication assistance and rehabilitation support for patients with motor impairments. These signals remain inherently susceptible to physiological artifacts (e.g., eye blinking, swallowing), which pose persistent challenges. Although Transformer-based approaches for classifying EEG and ECoG signals have been widely adopted, they often struggle to capture fine-grained dependencies within them. To overcome these limitations, we propose Cortical-SSM, a novel architecture that extends deep state space models to capture integrated dependencies of EEG and ECoG signals across temporal, spatial, and frequency domains. We validated our method across three benchmarks: 1) two large-scale public MI EEG datasets containing more than 50 subjects, and 2) a clinical MI ECoG dataset recorded from a patient with amyotrophic lateral sclerosis. Our method outperformed baseline methods on the three benchmarks. Furthermore, visual explanations derived from our model indicate that it effectively captures neurophysiologically relevant regions of both EEG and ECoG signals.
☆ RankSEG-RMA: An Efficient Segmentation Algorithm via Reciprocal Moment Approximation
Semantic segmentation labels each pixel in an image with its corresponding class, and is typically evaluated using the Intersection over Union (IoU) and Dice metrics to quantify the overlap between predicted and ground-truth segmentation masks. In the literature, most existing methods estimate pixel-wise class probabilities, then apply argmax or thresholding to obtain the final prediction. These methods have been shown to generally lead to inconsistent or suboptimal results, as they do not directly maximize segmentation metrics. To address this issue, a novel consistent segmentation framework, RankSEG, has been proposed, which includes RankDice and RankIoU specifically designed to optimize the Dice and IoU metrics, respectively. Although RankSEG almost guarantees improved performance, it suffers from two major drawbacks. First, it is its computational expense-RankDice has a complexity of O(d log d) with a substantial constant factor (where d represents the number of pixels), while RankIoU exhibits even higher complexity O(d^2), thus limiting its practical application. For instance, in LiTS, prediction with RankSEG takes 16.33 seconds compared to just 0.01 seconds with the argmax rule. Second, RankSEG is only applicable to overlapping segmentation settings, where multiple classes can occupy the same pixel, which contrasts with standard benchmarks that typically assume non-overlapping segmentation. In this paper, we overcome these two drawbacks via a reciprocal moment approximation (RMA) of RankSEG with the following contributions: (i) we improve RankSEG using RMA, namely RankSEG-RMA, reduces the complexity of both algorithms to O(d) while maintaining comparable performance; (ii) inspired by RMA, we develop a pixel-wise score function that allows efficient implementation for non-overlapping segmentation settings.
☆ Confidence-Weighted Semi-Supervised Learning for Skin Lesion Segmentation Using Hybrid CNN-Transformer Networks
Automated skin lesion segmentation through dermoscopic analysis is essential for early skin cancer detection, yet remains challenging due to limited annotated training data. We present MIRA-U, a semi-supervised framework that combines uncertainty-aware teacher-student pseudo-labeling with a hybrid CNN-Transformer architecture. Our approach employs a teacher network pre-trained via masked image modeling to generate confidence-weighted soft pseudo-labels, which guide a U-shaped CNN-Transformer student network featuring cross-attention skip connections. This design enhances pseudo-label quality and boundary delineation, surpassing reconstruction-based and CNN-only baselines, particularly in low-annotation regimes. Extensive evaluation on ISIC-2016 and PH2 datasets demonstrates superior performance, achieving a Dice Similarity Coefficient (DSC) of 0.9153 and Intersection over Union (IoU) of 0.8552 using only 50% labeled data. Code is publicly available on GitHub.
☆ SHARE: Scene-Human Aligned Reconstruction SIGGRAPH
Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry's inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human's positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.
comment: SIGGRAPH Asia Technical Communications 2025
☆ Proto-Former: Unified Facial Landmark Detection by Prototype Transformer
Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model's attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: https://github.com/Husk021118/Proto-Former.
comment: This paper has been accepted by TMM October 2025. Project page:https://github.com/Husk021118/Proto-Former
☆ Neural Posterior Estimation for Cataloging Astronomical Images from the Legacy Survey of Space and Time
The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will commence full-scale operations in 2026, yielding an unprecedented volume of astronomical images. Constructing an astronomical catalog, a table of imaged stars, galaxies, and their properties, is a fundamental step in most scientific workflows based on astronomical image data. Traditional deterministic cataloging methods lack statistical coherence as cataloging is an ill-posed problem, while existing probabilistic approaches suffer from computational inefficiency, inaccuracy, or the inability to perform inference with multiband coadded images, the primary output format for LSST images. In this article, we explore a recently developed Bayesian inference method called neural posterior estimation (NPE) as an approach to cataloging. NPE leverages deep learning to achieve both computational efficiency and high accuracy. When evaluated on the DC2 Simulated Sky Survey -- a highly realistic synthetic dataset designed to mimic LSST data -- NPE systematically outperforms the standard LSST pipeline in light source detection, flux measurement, star/galaxy classification, and galaxy shape measurement. Additionally, NPE provides well-calibrated posterior approximations. These promising results, obtained using simulated data, illustrate the potential of NPE in the absence of model misspecification. Although some degree of model misspecification is inevitable in the application of NPE to real LSST images, there are a variety of strategies to mitigate its effects.
☆ Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.
☆ Latent Diffusion Model without Variational Autoencoder
Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.
☆ Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning ICDM
Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.
comment: 8 pages, ICDM Workshop
☆ QCFace: Image Quality Control for boosting Face Representation & Recognition WACV 2026
Recognizability, a key perceptual factor in human face processing, strongly affects the performance of face recognition (FR) systems in both verification and identification tasks. Effectively using recognizability to enhance feature representation remains challenging. In deep FR, the loss function plays a crucial role in shaping how features are embedded. However, current methods have two main drawbacks: (i) recognizability is only partially captured through soft margin constraints, resulting in weaker quality representation and lower discrimination, especially for low-quality or ambiguous faces; (ii) mutual overlapping gradients between feature direction and magnitude introduce undesirable interactions during optimization, causing instability and confusion in hypersphere planning, which may result in poor generalization, and entangled representations where recognizability and identity are not cleanly separated. To address these issues, we introduce a hard margin strategy - Quality Control Face (QCFace), which overcomes the mutual overlapping gradient problem and enables the clear decoupling of recognizability from identity representation. Based on this strategy, a novel hard-margin-based loss function employs a guidance factor for hypersphere planning, simultaneously optimizing for recognition ability and explicit recognizability representation. Extensive experiments confirm that QCFace not only provides robust and quantifiable recognizability encoding but also achieves state-of-the-art performance in both verification and identification benchmarks compared to existing recognizability-based losses.
comment: 21 pages with 11 figures, 14 tables and 71 references. Accepted in Round 1 at WACV 2026, Oral
☆ Post-Processing Methods for Improving Accuracy in MRI Inpainting
Magnetic Resonance Imaging (MRI) is the primary imaging modality used in the diagnosis, assessment, and treatment planning for brain pathologies. However, most automated MRI analysis tools, such as segmentation and registration pipelines, are optimized for healthy anatomies and often fail when confronted with large lesions such as tumors. To overcome this, image inpainting techniques aim to locally synthesize healthy brain tissues in tumor regions, enabling the reliable application of general-purpose tools. In this work, we systematically evaluate state-of-the-art inpainting models and observe a saturation in their standalone performance. In response, we introduce a methodology combining model ensembling with efficient post-processing strategies such as median filtering, histogram matching, and pixel averaging. Further anatomical refinement is achieved via a lightweight U-Net enhancement stage. Comprehensive evaluation demonstrates that our proposed pipeline improves the anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models. By combining established models with targeted post-processing, we achieve improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable, resource-conscious research. Our 2025 BraTS inpainting docker is available at https://hub.docker.com/layers/aparida12/brats2025/inpt.
☆ CuSfM: CUDA-Accelerated Structure-from-Motion
Efficient and accurate camera pose estimation forms the foundational requirement for dense reconstruction in autonomous navigation, robotic perception, and virtual simulation systems. This paper addresses the challenge via cuSfM, a CUDA-accelerated offline Structure-from-Motion system that leverages GPU parallelization to efficiently employ computationally intensive yet highly accurate feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping. The system supports pose optimization, mapping, prior-map localization, and extrinsic refinement. It is designed for offline processing, where computational resources can be fully utilized to maximize accuracy. Experimental results demonstrate that cuSfM achieves significantly improved accuracy and processing speed compared to the widely used COLMAP method across various testing scenarios, while maintaining the high precision and global consistency essential for offline SfM applications. The system is released as an open-source Python wrapper implementation, PyCuSfM, available at https://github.com/nvidia-isaac/pyCuSFM, to facilitate research and applications in computer vision and robotics.
☆ DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion NeurIPS
We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to $424\times800$ at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.
comment: Accepted by NeurIPS Workshop on Next Practices in Video Generation and Evaluation (Short Paper Track)
☆ Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.
☆ The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads
Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at https://github.com/aysanaghazadeh/FaceOfPersuasion
☆ CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records ICCV 2025
Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at https://github.com/BCVUniandes/Cardium, and at the project website https://bcv-uniandes.github.io/CardiumPage/
comment: Accepted to CVAMD Workshop, ICCV 2025
☆ Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection
Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren't universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model's OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.
♻ ☆ From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance
Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and untrustworthy. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking precision, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a universal foundation for vision-language-action humanoid systems.
♻ ☆ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .
comment: Github Repo: https://github.com/PaddlePaddle/PaddleOCR
♻ ☆ FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models
Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).
♻ ☆ TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving
Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.
♻ ☆ DCMIL: A Progressive Representation Learning of Whole Slide Images for Cancer Prognosis Analysis
The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.
♻ ☆ Vision-Centric Activation and Coordination for Multimodal Large Language Models
Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
♻ ☆ Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
comment: Our project and code are available at https://ali-videoai.github.io/identity_page
♻ ☆ Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images
We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
CoDS: Enhancing Collaborative Perception in Heterogeneous Scenarios via Domain Separation
Collaborative perception has been proven to improve individual perception in autonomous driving through multi-agent interaction. Nevertheless, most methods often assume identical encoders for all agents, which does not hold true when these models are deployed in real-world applications. To realize collaborative perception in actual heterogeneous scenarios, existing methods usually align neighbor features to those of the ego vehicle, which is vulnerable to noise from domain gaps and thus fails to address feature discrepancies effectively. Moreover, they adopt transformer-based modules for domain adaptation, which causes the model inference inefficiency on mobile devices. To tackle these issues, we propose CoDS, a Collaborative perception method that leverages Domain Separation to address feature discrepancies in heterogeneous scenarios. The CoDS employs two feature alignment modules, i.e., Lightweight Spatial-Channel Resizer (LSCR) and Distribution Alignment via Domain Separation (DADS). Besides, it utilizes the Domain Alignment Mutual Information (DAMI) loss to ensure effective feature alignment. Specifically, the LSCR aligns the neighbor feature across spatial and channel dimensions using a lightweight convolutional layer. Subsequently, the DADS mitigates feature distribution discrepancy with encoder-specific and encoder-agnostic domain separation modules. The former removes domain-dependent information and the latter captures task-related information. During training, the DAMI loss maximizes the mutual information between aligned heterogeneous features to enhance the domain separation process. The CoDS employs a fully convolutional architecture, which ensures high inference efficiency. Extensive experiments demonstrate that the CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade-off between detection accuracy and inference efficiency.
comment: Accepted by IEEE Transactions on Mobile Computing
♻ ☆ CLOVER: Context-aware Long-term Object Viewpoint- and Environment- Invariant Representation Learning
Mobile service robots can benefit from object-level understanding of their environments, including the ability to distinguish object instances and re-identify previously seen instances. Object re-identification is challenging across different viewpoints and in scenes with significant appearance variation arising from weather or lighting changes. Existing works on object re-identification either focus on specific classes or require foreground segmentation. Further, these methods, along with object re-identification datasets, have limited consideration of challenges such as outdoor scenes and illumination changes. To address this problem, we introduce CODa Re-ID: an in-the-wild object re-identification dataset containing 1,037,814 observations of 557 objects across 8 classes under diverse lighting conditions and viewpoints. Further, we propose CLOVER, a representation learning method for object observations that can distinguish between static object instances without requiring foreground segmentation. We also introduce MapCLOVER, a method for scalably summarizing CLOVER descriptors for use in object maps and matching new observations to summarized descriptors. Our results show that CLOVER achieves superior performance in static object re-identification under varying lighting conditions and viewpoint changes and can generalize to unseen instances and classes.
comment: 8 pages, 3 figures, 8 tables
♻ ☆ Refer to Any Segmentation Mask Group With Vision-Language Prompts ICCV 2025
Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.
comment: ICCV 2025
♻ ☆ CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models
We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.
comment: 8 pages, 3 figures. Endorsed by Dr. Seyedmorteza Sadat (ETH Zurich). The work introduces CADE 2.5 with ZeResFDG as a practical inference-time guidance stack for SD/SDXL. Code and visual examples to be released on GitHub and Hugging Face
♻ ☆ CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding
Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.
comment: Preprint, 27 pages, 3 figures
♻ ☆ KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection
Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.
comment: Withdrawn by the authors due to lack of explicit agreement from all co-authors to post this version publicly on arXiv
♻ ☆ A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography
Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson's r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson's r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.
comment: 25 pages, 3 figures, 8 tables
Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise
Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t\!\to\!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.
♻ ☆ CoUn: Empowering Machine Unlearning via Contrastive Learning
Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.
♻ ☆ Methods and Trends in Detecting AI-Generated Images: A Comprehensive Review
The proliferation of generative models, such as Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs), has enabled the synthesis of high-quality multimedia data. However, these advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm. Recognizing these challenges, researchers have increasingly focused on developing methodologies to detect synthesized data effectively, aiming to mitigate potential risks. Prior reviews have predominantly focused on deepfake detection and often overlook recent advancements in synthetic image forensics, particularly approaches that incorporate multimodal frameworks, reasoning-based detection, and training-free methodologies. To bridge this gap, this survey provides a comprehensive and up-to-date review of state-of-the-art techniques for detecting and classifying synthetic images generated by advanced generative AI models. The review systematically examines core detection paradigms, categorizes them into spatial-domain, frequency-domain, fingerprint-based, patch-based, training-free, and multimodal reasoning-based frameworks, and offers concise descriptions of their underlying principles. We further provide detailed comparative analyses of these methods on publicly available datasets to assess their generalizability, robustness, and interpretability. Finally, the survey highlights open challenges and future directions, emphasizing the potential of hybrid frameworks that combine the efficiency of training-free approaches with the semantic reasoning of multimodal models to advance trustworthy and explainable synthetic image forensics.
comment: 34 pages, 4 Figures, 10 Tables
♻ ☆ Low-Frequency First: Eliminating Floating Artifacts in 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) is a powerful and computationally efficient representation for 3D reconstruction. Despite its strengths, 3DGS often produces floating artifacts, which are erroneous structures detached from the actual geometry and significantly degrade visual fidelity. The underlying mechanisms causing these artifacts, particularly in low-quality initialization scenarios, have not been fully explored. In this paper, we investigate the origins of floating artifacts from a frequency-domain perspective and identify under-optimized Gaussians as the primary source. Based on our analysis, we propose \textit{Eliminating-Floating-Artifacts} Gaussian Splatting (EFA-GS), which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning. Additionally, we introduce complementary depth-based and scale-based strategies to dynamically refine Gaussian expansion, effectively mitigating detail erosion. Extensive experiments on both synthetic and real-world datasets demonstrate that EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving an improvement of 1.68 dB in PSNR over baseline method on our RWLQ dataset. Furthermore, we validate the effectiveness of our approach in downstream 3D editing tasks. Project Website: https://jcwang-gh.github.io/EFA-GS
comment: Our paper has been accepted by the 24th International Conference on Cyberworlds and recieved the Best Paper Honorable Mention
♻ ☆ Foundation Model-Driven Classification of Atypical Mitotic Figures with Domain-Aware Training Strategies
We present a solution for the MIDOG 2025 Challenge Track~2, addressing binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs). The approach leverages pathology-specific foundation model H-optimus-0, selected based on recent cross-domain generalization benchmarks and our empirical testing, with Low-Rank Adaptation (LoRA) fine-tuning and MixUp augmentation. Implementation includes soft labels based on multi-expert consensus, hard negative mining, and adaptive focal loss, metric learning and domain adaptation. The method demonstrates both the promise and challenges of applying foundation models to this complex classification task, achieving reasonable performance in the preliminary evaluation phase.
♻ ☆ PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification
The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and underexploited spectral complementarity. Existing approaches often fail to decouple shared structural features from modality-complementary radiometric attributes, resulting in feature conflicts and information loss. To address this, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-complementary) components in the Fourier domain. This design reinforces shared structures while preserving complementary characteristics, thereby enhancing fusion quality. Unlike previous methods that overlook the distinct physical properties encoded in frequency spectra, PAD explicitly introduces amplitude-phase decoupling for multi-modal fusion. Specifically, PAD comprises two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features via convolution-guided scaling to improve geometric consistency; and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high- and low-frequency patterns using frequency-adaptive multilayer perceptrons, effectively exploiting SAR's morphological sensitivity and RGB's spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK demonstrate state-of-the-art performance. This work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at https://github.com/RanFeng2/PAD.
comment: 14 pages, 8 figures
♻ ☆ Skull-stripping induces shortcut learning in MRI-based Alzheimer's disease classification
Objectives: High classification accuracy of Alzheimer's disease (AD) from structural MRI has been achieved using deep neural networks, yet the specific image features contributing to these decisions remain unclear. In this study, the contributions of T1-weighted (T1w) gray-white matter texture, volumetric information, and preprocessing -- particularly skull-stripping -- were systematically assessed. Methods: A dataset of 990 matched T1w MRIs from AD patients and cognitively normal controls from the ADNI database were used. Preprocessing was varied through skull-stripping and intensity binarization to isolate texture and shape contributions. A 3D convolutional neural network was trained on each configuration, and classification performance was compared using exact McNemar tests with discrete Bonferroni-Holm correction. Feature relevance was analyzed using Layer-wise Relevance Propagation, image similarity metrics, and spectral clustering of relevance maps. Results: Despite substantial differences in image content, classification accuracy, sensitivity, and specificity remained stable across preprocessing conditions. Models trained on binarized images preserved performance, indicating minimal reliance on gray-white matter texture. Instead, volumetric features -- particularly brain contours introduced through skull-stripping -- were consistently used by the models. Conclusions: This behavior reflects a shortcut learning phenomenon, where preprocessing artifacts act as potentially unintended cues. The resulting Clever Hans effect emphasizes the critical importance of interpretability tools to reveal hidden biases and to ensure robust and trustworthy deep learning in medical imaging.
♻ ☆ Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.
♻ ☆ Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification
Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at https://github.com/RikoLi/DCAC.
♻ ☆ Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
comment: https://genie-envisioner.github.io/
♻ ☆ NFIG: Autoregressive Image Generation with Next-Frequency Prediction
Autoregressive models have achieved promising results in natural language processing. However, for image generation tasks, they encounter substantial challenges in effectively capturing long-range dependencies, managing computational costs, and most crucially, defining meaningful autoregressive sequences that reflect natural image hierarchies. To address these issues, we present \textbf{N}ext-\textbf{F}requency \textbf{I}mage \textbf{G}eneration (\textbf{NFIG}), a novel framework that decomposes the image generation process into multiple frequency-guided stages. Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images. This principled autoregressive sequence not only improves the quality of generated images by better capturing true causal relationships between image components, but also significantly reduces computational overhead during inference. Extensive experiments demonstrate that NFIG achieves state-of-the-art performance with fewer steps, offering a more efficient solution for image generation, with 1.25$\times$ speedup compared to VAR-d20 while achieving better performance (FID: 2.81) on the ImageNet-256 benchmark. We hope that our insight of incorporating frequency-domain knowledge to guide autoregressive sequence design will shed light on future research. We will make our code publicly available upon acceptance of the paper.
comment: 10 pages, 7 figures, 2 tables
♻ ☆ Conformal Risk Control for Pulmonary Nodule Detection
Quantitative tools are increasingly appealing for decision support in healthcare, driven by the growing capabilities of advanced AI systems. However, understanding the predictive uncertainties surrounding a tool's output is crucial for decision-makers to ensure reliable and transparent decisions. In this paper, we present a case study on pulmonary nodule detection for lung cancer screening, enhancing an advanced detection model with an uncertainty quantification technique called conformal risk control (CRC). We demonstrate that prediction sets with conformal guarantees are attractive measures of predictive uncertainty in the safety-critical healthcare domain, allowing end-users to achieve arbitrary validity by trading off false positives and providing formal statistical guarantees on model performance. Among ground-truth nodules annotated by at least three radiologists, our model achieves a sensitivity that is competitive with that generally achieved by individual radiologists, with a slight increase in false positives. Furthermore, we illustrate the risks of using off-the-shelve prediction models when faced with ontological uncertainty, such as when radiologists disagree on what constitutes the ground truth on pulmonary nodules.
♻ ☆ Leveraging Learning Bias for Noisy Anomaly Detection
This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework's contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.
♻ ☆ Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.
♻ ☆ FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
♻ ☆ Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
comment: The code and model are available at https://github.com/M-E-AGI-Lab/Muddit
♻ ☆ First-order State Space Model for Lightweight Image Super-resolution
State space models (SSMs), particularly Mamba, have shown promise in NLP tasks and are increasingly applied to vision tasks. However, most Mamba-based vision models focus on network architecture and scan paths, with little attention to the SSM module. In order to explore the potential of SSMs, we modified the calculation process of SSM without increasing the number of parameters to improve the performance on lightweight super-resolution tasks. In this paper, we introduce the First-order State Space Model (FSSM) to improve the original Mamba module, enhancing performance by incorporating token correlations. We apply a first-order hold condition in SSMs, derive the new discretized form, and analyzed cumulative error. Extensive experimental results demonstrate that FSSM improves the performance of MambaIR on five benchmark datasets without additionally increasing the number of parameters, and surpasses current lightweight SR methods, achieving state-of-the-art results.
comment: Accept by ICASSP 2025 (Oral)
♻ ☆ HumorDB: Can AI understand graphical humor?
Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces \textbf{HumorDB}, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, suggesting that an effective understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features and bridging the gap between visual perception and abstract reasoning. All the code and data are available here: \href{https://github.com/kreimanlab/HumorDB}{https://github.com/kreimanlab/HumorDB}
comment: 10 main figures, 4 additional appendix figures
♻ ☆ Diffusion Models are Efficient Data Generators for Human Mesh Recovery
Despite remarkable progress having been made on the problem of 3D human pose and shape estimation (HPS), current state-of-the-art methods rely heavily on either confined indoor mocap datasets or datasets generated by a rendering engine using computer graphics (CG). Both categories of datasets exhibit inadequacies in furnishing adequate human identities and authentic in-the-wild background scenes, which are crucial for accurately simulating real-world distributions. In this work, we show that synthetic data created by generative models is complementary to CG-rendered data for achieving remarkable generalization performance on diverse real-world scenes. We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. Specifically, we first collect a large-scale human-centric dataset with comprehensive annotations, e.g, text captions, the depth map, and surface normal images. To generate a wide variety of human images with initial labels, we train a customized, multi-condition ControlNet model. The key to this process is using a 3D parametric model, e.g, SMPL-X, to create various condition inputs easily. Our data generation pipeline is both flexible and customizable, making it adaptable to multiple real-world tasks, such as human interaction in complex scenes and humans captured by wide-angle lenses. By relying solely on generative models, we can produce large-scale, in-the-wild human images with high-quality annotations, significantly reducing the need for manual image collection and annotation. The generated dataset encompasses a wide range of viewpoints, environments, and human identities, ensuring its versatility across different scenarios. We hope that our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.
comment: Accepted by TPAMI, project page: https://yongtaoge.github.io/projects/humanwild
♻ ☆ UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection CVPR2025
Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform "local and global" spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both "local and global" spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.
comment: Accepted to CVPR2025
♻ ☆ Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code will be released at https://ruoyuchen10.github.io/EAGLE/.
♻ ☆ CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks recently. However, their heavy computation costs remain daunting for resource-limited devices. To address this, researchers have dedicated themselves to compressing redundant information in ViTs for acceleration. However, existing approaches generally sparsely drop redundant image tokens by token pruning or brutally remove channels by channel pruning, leading to a sub-optimal balance between model performance and inference speed. Moreover, they struggle when transferring compressed models to downstream vision tasks that require the spatial structure of images, such as semantic segmentation. To tackle these issues, we propose CAIT, a joint \underline{c}ompression method for ViTs that achieves a harmonious blend of high \underline{a}ccuracy, fast \underline{i}nference speed, and favorable \underline{t}ransferability to downstream tasks. Specifically, we introduce an asymmetric token merging (ATME) strategy to effectively integrate neighboring tokens. It can successfully compress redundant token information while preserving the spatial structure of images. On top of it, we further design a consistent dynamic channel pruning (CDCP) strategy to dynamically prune unimportant channels in ViTs. Thanks to CDCP, insignificant channels in multi-head self-attention modules of ViTs can be pruned uniformly, significantly enhancing the model compression. Extensive experiments on multiple benchmark datasets show that our proposed method can achieve state-of-the-art performance across various ViTs.
comment: TPAMI 2025 Camera-ready Version
♻ ☆ Learning Unified Representations from Heterogeneous Data for Robust Heart Rate Modeling
Heart rate prediction is vital for personalized health monitoring and fitness, while it frequently faces a critical challenge when deploying in real-world: data heterogeneity. We classify it in two key dimensions: source heterogeneity from fragmented device markets with varying feature sets, and user heterogeneity reflecting distinct physiological patterns across individuals and activities. Existing methods either discard device-specific information, or fail to model user-specific differences, limiting their real-world performance. To address this, we propose a framework that learns latent representations agnostic to both heterogeneity, enabling downstream predictors to work consistently under heterogeneous data patterns. Specifically, we introduce a random feature dropout strategy to handle source heterogeneity, making the model robust to various feature sets. To manage user heterogeneity, we employ a time-aware attention module to capture long-term physiological traits and use a contrastive learning objective to build a discriminative representation space. To reflect the heterogeneous nature of real-world data, we created and publicly released a new benchmark dataset, ParroTao. Evaluations on both ParroTao and the public FitRec dataset show that our model significantly outperforms existing baselines by 17% and 15%, respectively. Furthermore, analysis of the learned representations demonstrates their strong discriminative power, and one downstream application task confirm the practical value of our model.
♻ ☆ PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation NeurIPS 2025
Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at https://github.com/THU-MIG/PrefixKV.
comment: NeurIPS 2025 Camera-ready Version
♻ ☆ YOLOE: Real-Time Seeing Anything ICCV 2025
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at https://github.com/THU-MIG/yoloe.
comment: ICCV 2025 Camera-ready Version
♻ ☆ SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians
Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.
comment: For video demonstrations and additional materials please see https://nlml.github.io/sheap/
♻ ☆ CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image ICCV 2025
Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.
comment: Accepted at ICCV 2025
♻ ☆ UNet with Self-Adaptive Mamba-Like Attention and Causal-Resonance Learning for Medical Image Segmentation
Medical image segmentation plays an important role in various clinical applications; however, existing deep learning models face trade-offs between efficiency and accuracy. Convolutional Neural Networks (CNNs) capture local details well but miss the global context, whereas transformers handle the global context but at a high computational cost. Recently, State Space Sequence Models (SSMs) have shown potential for capturing long-range dependencies with linear complexity; however, their direct use in medical image segmentation remains limited due to incompatibility with image structures and autoregressive assumptions. To overcome these challenges, we propose SAMA-UNet, a novel U-shaped architecture that introduces two key innovations. First, the Self-Adaptive Mamba-like Aggregated Attention (SAMA) block adaptively integrates local and global features through dynamic attention weighting, enabling an efficient representation of complex anatomical patterns. Second, the causal resonance multi-scale module (CR-MSM) improves encoder-decoder interactions by adjusting feature resolution and causal dependencies across scales, enhancing the semantic alignment between low- and high-level features. Extensive experiments on MRI, CT, and endoscopy datasets demonstrate that SAMA-UNet consistently outperforms CNN, Transformer, and Mamba-based methods. It achieves 85.38% DSC and 87.82% NSD on BTCV, 92.16% and 96.54% on ACDC, 67.14% and 68.70% on EndoVis17, and 84.06% and 88.47% on ATLAS23, establishing new benchmarks across modalities. These results confirm the effectiveness of SAMA-UNet in combining efficiency and accuracy, making it a promising solution for real-world clinical segmentation tasks. The source code is available on GitHub.
♻ ☆ X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Code is publicly available at: https://x2-gaussian.github.io/.
comment: Project Page: https://x2-gaussian.github.io/
♻ ☆ Towards Reliable and Holistic Visual In-Context Learning Prompt Selection NeurIPS 2025
Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.
comment: Accepted by NeurIPS 2025
♻ ☆ VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2$\times$ speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.
comment: 32 pages, 5 figures, 13 tables
♻ ☆ RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement
Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera's rolling-shutter readout and the display's brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.
♻ ☆ BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring ACM MM 2025
3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene reconstruction. However, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant challenge.The performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion blur. To solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred images. BSGS contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur distortions. To alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both stages. Furthermore, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.Our source code is available at https://github.com/wsxujm/bsgs
comment: Accept by ACM MM 2025
♻ ☆ Normal-Abnormal Guided Generalist Anomaly Detection NeurIPS 2025
Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at https://github.com/JasonKyng/NAGL.
comment: Accepted by NeurIPS 2025
♻ ☆ VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
comment: Homepage: https://ltbai.github.io/VITA-VLA/
Scope: Selective Cross-modal Orchestration of Visual Perception Experts
Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49\%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.
comment: 14 pages, 2 figures
♻ ☆ FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.
comment: Dataset and code are available at https://github.com/HaoYin116/FLEX . Link to Project page https://haoyin116.github.io/FLEX_Dataset
♻ ☆ V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception NeurIPS 2025
Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar, a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets. We will release all datasets and benchmark codebase at http://openmpd.com/column/V2X-Radar and https://github.com/yanglei18/V2X-Radar.
comment: NeurIPS 2025 Spotlight
♻ ☆ A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry
Accurate and reliable estimation of biases of low-cost Inertial Measurement Units (IMU) is a key factor to maintain the resilience of Visual-Inertial Odometry (VIO), particularly when visual tracking fails in challenging areas. In such cases, bias estimates from the VIO can deviate significantly from the real values because of the insufficient or erroneous vision features, compromising both localization accuracy and system stability. To address this challenge, we propose a novel plug-and-play module featuring the Inertial Prior Network (IPNet), which infers an IMU bias prior by implicitly capturing the motion characteristics of specific platforms. The core idea is inspired intuitively by the observation that different platforms exhibit distinctive motion patterns, while the integration of low-cost IMU measurements suffers from unbounded error that quickly accumulates over time. Therefore, these specific motion patterns can be exploited to infer the underlying IMU bias. In this work, we first directly infer the biases prior only using the raw IMU data using a sliding window approach, eliminating the dependency on recursive bias estimation combining visual features, thus effectively preventing error propagation in challenging areas. Moreover, to compensate for the lack of ground-truth bias in most visual-inertial datasets, we further introduce an iterative method to compute the mean per-sequence IMU bias for network training and release it to benefit society. The framework is trained and evaluated separately on two public datasets and a self-collected dataset. Extensive experiments show that our method significantly improves localization precision and robustness.
♻ ☆ General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting
Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed information flows, limiting their generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge for reasoning and planning, but prior LVLM-robot integrations have largely depended on pre-mapped spaces, hard-coded representations, and rigid control logic. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools drawn from modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query modules, reason over multimodal inputs, and select navigation actions. This agentic formulation enables robust navigation and reasoning in previously unmapped environments, offering a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA outperforms state-of-the-art EQA-specific approaches. Qualitative results on RxR and custom tasks further demonstrate its ability to generalize across a broad range of navigation challenges.
♻ ☆ Self-supervised Multi-future Occupancy Forecasting for Autonomous Driving
Environment prediction frameworks are critical for the safe navigation of autonomous vehicles (AVs) in dynamic settings. LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird's-eye view for the scene representation, enabling self-supervised joint scene predictions while exhibiting resilience to partial observability and perception detection failures. Prior approaches have focused on deterministic L-OGM prediction architectures within the grid cell space. While these methods have seen some success, they frequently produce unrealistic predictions and fail to capture the stochastic nature of the environment. Additionally, they do not effectively integrate additional sensor modalities present in AVs. Our proposed framework, Latent Occupancy Prediction (LOPR), performs stochastic L-OGM prediction in the latent space of a generative architecture and allows for conditioning on RGB cameras, maps, and planned trajectories. We decode predictions using either a single-step decoder, which provides high-quality predictions in real-time, or a diffusion-based batch decoder, which can further refine the decoded frames to address temporal consistency issues and reduce compression losses. Our experiments on the nuScenes and Waymo Open datasets show that all variants of our approach qualitatively and quantitatively outperform prior approaches.
comment: Accepted to Robotics: Science and Systems (RSS) 2025
♻ ☆ SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.
comment: 20 pages, 11 figures
♻ ☆ L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery NeurIPS 2025
We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct XA-L\&RSI dataset, which encompasses approximately $110,000$ remote sensing submaps and $13,000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on XA-L\&RSI demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83.27\%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https://shizw695.github.io/L2RSI/.
comment: 17 pages, 7 figures, NeurIPS 2025
♻ ☆ CMaP-SAM: Contraction Mapping Prior for SAM-driven Few-shot Segmentation
Few-shot segmentation (FSS) aims to segment new classes using few annotated images. While recent FSS methods have shown considerable improvements by leveraging Segment Anything Model (SAM), they face two critical limitations: insufficient utilization of structural correlations in query images, and significant information loss when converting continuous position priors to discrete point prompts. To address these challenges, we propose CMaP-SAM, a novel framework that introduces contraction mapping theory to optimize position priors for SAM-driven few-shot segmentation. CMaP-SAM consists of three key components: (1) a contraction mapping module that formulates position prior optimization as a Banach contraction mapping with convergence guarantees. This module iteratively refines position priors through pixel-wise structural similarity, generating a converged prior that preserves both semantic guidance from reference images and structural correlations in query images; (2) an adaptive distribution alignment module bridging continuous priors with SAM's binary mask prompt encoder; and (3) a foreground-background decoupled refinement architecture producing accurate final segmentation masks. Extensive experiments demonstrate CMaP-SAM's effectiveness, achieving state-of-the-art performance with 71.1 mIoU on PASCAL-$5^i$ and 56.1 on COCO-$20^i$ datasets. Code is available at https://github.com/Chenfan0206/CMaP-SAM.
comment: 7 figures
♻ ☆ WebInject: Prompt Injection Attack to Web Agents
Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. In this work, we propose WebInject, a prompt injection attack that manipulates the webpage environment to induce a web agent to perform an attacker-specified action. Our attack adds a perturbation to the raw pixel values of the rendered webpage. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the attacker-specified action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple datasets shows that WebInject is highly effective and significantly outperforms baselines.
comment: Appeared in EMNLP 2025 main conference. To better understand prompt injection attacks, see https://people.duke.edu/~zg70/code/PromptInjection.pdf
♻ ☆ iLRM: An Iterative Large 3D Reconstruction Model
Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.
comment: Project page: https://gynjn.github.io/iLRM/
♻ ☆ D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
comment: Corrected typos
♻ ☆ Hyperbolic Dataset Distillation NeurIPS 2025
To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. To the best of our knowledge, this is the first work to incorporate the hyperbolic space into the dataset distillation process. The code is available at https://github.com/Guang000/HDD.
comment: Accepted to NeurIPS 2025
♻ ☆ MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation
Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.
♻ ☆ AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
With the raid evolution of large language models and multimodal models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that should be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, APPs, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose mobile agent that operates across applications. AppCopilot operationalizes this position through an end-to-end pipeline spanning data collection, training, finetuning, efficient inference, and PC/mobile application. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables experiential adaptation, voice interaction, function calling, cross-APP and cross-device orchestration, and comprehensive mobile APP support. The system design incorporates profiling-driven optimization for latency and memory across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements on four dimensions: stronger generalization, higher precision of on screen actions, more reliable long horizon task completion, and faster, more resource efficient runtime. By articulating a cohesive position and a reference architecture that closes the loop from data collection, training to finetuning and efficient inference, this paper offers a concrete roadmap for general purpose mobile agent and provides actionable guidance.
comment: Project at https://github.com/OpenBMB/AppCopilot
♻ ☆ Exploring and Leveraging Class Vectors for Classifier Editing NeurIPS 2025
Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce Class Vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, Class Vectors disentangle each class's adaptation in the latent space. We show that Class Vectors capture each class's semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of Class Vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.
comment: Accepted in NeurIPS 2025
♻ ☆ Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next (text) token prediction. Moreover, through extensive probing, we observe improved visual representation quality due to embedding optimization, underscoring the effectiveness of our probing setup. We demonstrate that our VisPer-LM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. In particular, VisPer-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
comment: Project Page: https://praeclarumjj3.github.io/visper_lm/
Machine Learning 209
☆ SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling
Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.
comment: 13 pages, EMNLP 2025
☆ Self-Certifying Primal-Dual Optimization Proxies for Large-Scale Batch Economic Dispatch
Recent research has shown that optimization proxies can be trained to high fidelity, achieving average optimality gaps under 1% for large-scale problems. However, worst-case analyses show that there exist in-distribution queries that result in orders of magnitude higher optimality gap, making it difficult to trust the predictions in practice. This paper aims at striking a balance between classical solvers and optimization proxies in order to enable trustworthy deployments with interpretable speed-optimality tradeoffs based on a user-defined optimality threshold. To this end, the paper proposes a hybrid solver that leverages duality theory to efficiently bound the optimality gap of predictions, falling back to a classical solver for queries where optimality cannot be certified. To improve the achieved speedup of the hybrid solver, the paper proposes an alternative training procedure that combines the primal and dual proxy training. Experiments on large-scale transmission systems show that the hybrid solver is highly scalable. The proposed hybrid solver achieves speedups of over 1000x compared to a parallelized simplex-based solver while guaranteeing a maximum optimality gap of 2%.
☆ Learning Correlated Reward Models: Statistical Barriers and Opportunities
Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses \emph{all} human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a \emph{correlated} probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is \emph{fundamentally insufficient} to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that \emph{best-of-three} preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.
☆ Transfer Orthology Networks
We present Transfer Orthology Networks (TRON), a novel neural network architecture designed for cross-species transfer learning. TRON leverages orthologous relationships, represented as a bipartite graph between species, to guide knowledge transfer. Specifically, we prepend a learned species conversion layer, whose weights are masked by the biadjacency matrix of this bipartite graph, to a pre-trained feedforward neural network that predicts a phenotype from gene expression data in a source species. This allows for efficient transfer of knowledge to a target species by learning a linear transformation that maps gene expression from the source to the target species' gene space. The learned weights of this conversion layer offer a potential avenue for interpreting functional orthology, providing insights into how genes across species contribute to the phenotype of interest. TRON offers a biologically grounded and interpretable approach to cross-species transfer learning, paving the way for more effective utilization of available transcriptomic data. We are in the process of collecting cross-species transcriptomic/phenotypic data to gain experimental validation of the TRON architecture.
comment: 4 pages
☆ FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement
Quantum computing has the potential to revolutionize fields like quantum optimization and quantum machine learning. However, current quantum devices are hindered by noise, reducing their reliability. A key challenge in gate-based quantum computing is improving the reliability of quantum circuits, measured by process fidelity, during the transpilation process, particularly in the routing stage. In this paper, we address the Fidelity Maximization in Routing Stage (FMRS) problem by introducing FIDDLE, a novel learning framework comprising two modules: a Gaussian Process-based surrogate model to estimate process fidelity with limited training samples and a reinforcement learning module to optimize routing. Our approach is the first to directly maximize process fidelity, outperforming traditional methods that rely on indirect metrics such as circuit depth or gate count. We rigorously evaluate FIDDLE by comparing it with state-of-the-art fidelity estimation techniques and routing optimization methods. The results demonstrate that our proposed surrogate model is able to provide a better estimation on the process fidelity compared to existing learning techniques, and our end-to-end framework significantly improves the process fidelity of quantum circuits across various noise models.
☆ SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients
The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.
☆ Blackwell's Approachability for Sequential Conformal Inference
We study conformal inference in non-exchangeable environments through the lens of Blackwell's theory of approachability. We first recast adaptive conformal inference (ACI, Gibbs and Cand\`es, 2021) as a repeated two-player vector-valued finite game and characterize attainable coverage--efficiency tradeoffs. We then construct coverage and efficiency objectives under potential restrictions on the adversary's play, and design a calibration-based approachability strategy to achieve these goals. The resulting algorithm enjoys strong theoretical guarantees and provides practical insights, though its computational burden may limit deployment in practice.
comment: 25 pages, 0 figures
☆ Chronos-2: From Univariate to Universal Forecasting
Pretrained time series models have enabled inference-only forecasting systems that produce accurate predictions without task-specific training. However, existing approaches largely focus on univariate forecasting, limiting their applicability in real-world scenarios where multivariate data and covariates play a crucial role. We present Chronos-2, a pretrained model capable of handling univariate, multivariate, and covariate-informed forecasting tasks in a zero-shot manner. Chronos-2 employs a group attention mechanism that facilitates in-context learning (ICL) through efficient information sharing across multiple time series within a group, which may represent sets of related series, variates of a multivariate series, or targets and covariates in a forecasting task. These general capabilities are achieved through training on synthetic datasets that impose diverse multivariate structures on univariate series. Chronos-2 delivers state-of-the-art performance across three comprehensive benchmarks: fev-bench, GIFT-Eval, and Chronos Benchmark II. On fev-bench, which emphasizes multivariate and covariate-informed forecasting, Chronos-2's universal ICL capabilities lead to substantial improvements over existing models. On tasks involving covariates, it consistently outperforms baselines by a wide margin. Case studies in the energy and retail domains further highlight its practical advantages. The in-context learning capabilities of Chronos-2 establish it as a general-purpose forecasting model that can be used "as is" in real-world forecasting pipelines.
☆ Error analysis of a compositional score-based algorithm for simulation-based inference
Simulation-based inference (SBI) has become a widely used framework in applied sciences for estimating the parameters of stochastic models that best explain experimental observations. A central question in this setting is how to effectively combine multiple observations in order to improve parameter inference and obtain sharper posterior distributions. Recent advances in score-based diffusion methods address this problem by constructing a compositional score, obtained by aggregating individual posterior scores within the diffusion process. While it is natural to suspect that the accumulation of individual errors may significantly degrade sampling quality as the number of observations grows, this important theoretical issue has so far remained unexplored. In this paper, we study the compositional score produced by the GAUSS algorithm of Linhart et al. (2024) and establish an upper bound on its mean squared error in terms of both the individual score errors and the number of observations. We illustrate our theoretical findings on a Gaussian example, where all analytical expressions can be derived in a closed form.
☆ On Universality of Deep Equivariant Networks
Universality results for equivariant neural networks remain rare. Those that do exist typically hold only in restrictive settings: either they rely on regular or higher-order tensor representations, leading to impractically high-dimensional hidden spaces, or they target specialized architectures, often confined to the invariant setting. This work develops a more general account. For invariant networks, we establish a universality theorem under separation constraints, showing that the addition of a fully connected readout layer secures approximation within the class of separation-constrained continuous functions. For equivariant networks, where results are even scarcer, we demonstrate that standard separability notions are inadequate and introduce the sharper criterion of $\textit{entry-wise separability}$. We show that with sufficient depth or with the addition of appropriate readout layers, equivariant networks attain universality within the entry-wise separable regime. Together with prior results showing the failure of universality for shallow models, our findings identify depth and readout layers as a decisive mechanism for universality, additionally offering a unified perspective that subsumes and extends earlier specialized results.
comment: Preprint. 22 pages
☆ AB-UPT for Automotive and Aerospace Applications
The recently proposed Anchored-Branched Universal Physics Transformers (AB-UPT) shows strong capabilities to replicate automotive computational fluid dynamics simulations requiring orders of magnitudes less compute than traditional numerical solvers. In this technical report, we add two new datasets to the body of empirically evaluated use-cases of AB-UPT, combining high-quality data generation with state-of-the-art neural surrogates. Both datasets were generated with the Luminary Cloud platform containing automotives (SHIFT-SUV) and aircrafts (SHIFT-Wing). We start by detailing the data generation. Next, we show favorable performances of AB-UPT against previous state-of-the-art transformer-based baselines on both datasets, followed by extensive qualitative and quantitative evaluations of our best AB-UPT model. AB-UPT shows strong performances across the board. Notably, it obtains near perfect prediction of integrated aerodynamic forces within seconds from a simple isotopically tesselate geometry representation and is trainable within a day on a single GPU, paving the way for industry-scale applications.
☆ Cavity Duplexer Tuning with 1d Resnet-like Neural Networks
This paper presents machine learning method for tuning of cavity duplexer with a large amount of adjustment screws. After testing we declined conventional reinforcement learning approach and reformulated our task in the supervised learning setup. The suggested neural network architecture includes 1d ResNet-like backbone and processing of some additional information about S-parameters, like the shape of curve and peaks positions and amplitudes. This neural network with external control algorithm is capable to reach almost the tuned state of the duplexer within 4-5 rotations per screw.
☆ DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation
We present DexCanvas, a large-scale hybrid real-synthetic human manipulation dataset containing 7,000 hours of dexterous hand-object interactions seeded from 70 hours of real human demonstrations, organized across 21 fundamental manipulation types based on the Cutkosky taxonomy. Each entry combines synchronized multi-view RGB-D, high-precision mocap with MANO hand parameters, and per-frame contact points with physically consistent force profiles. Our real-to-sim pipeline uses reinforcement learning to train policies that control an actuated MANO hand in physics simulation, reproducing human demonstrations while discovering the underlying contact forces that generate the observed object motion. DexCanvas is the first manipulation dataset to combine large-scale real demonstrations, systematic skill coverage based on established taxonomies, and physics-validated contact annotations. The dataset can facilitate research in robotic manipulation learning, contact-rich control, and skill transfer across different hand morphologies.
☆ Enhanced Renewable Energy Forecasting using Context-Aware Conformal Prediction
Accurate forecasting is critical for reliable power grid operations, particularly as the share of renewable generation, such as wind and solar, continues to grow. Given the inherent uncertainty and variability in renewable generation, probabilistic forecasts have become essential for informed operational decisions. However, such forecasts frequently suffer from calibration issues, potentially degrading decision-making performance. Building on recent advances in Conformal Predictions, this paper introduces a tailored calibration framework that constructs context-aware calibration sets using a novel weighting scheme. The proposed framework improves the quality of probabilistic forecasts at the site and fleet levels, as demonstrated by numerical experiments on large-scale datasets covering several systems in the United States. The results demonstrate that the proposed approach achieves higher forecast reliability and robustness for renewable energy applications compared to existing baselines.
☆ Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model
Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.
☆ On Non-interactive Evaluation of Animal Communication Translators
If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,'' false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.
☆ QSilk: Micrograin Stabilization and Adaptive Quantile Clipping for Detail-Friendly Latent Diffusion
We present QSilk, a lightweight, always-on stabilization layer for latent diffusion that improves high-frequency fidelity while suppressing rare activation spikes. QSilk combines (i) a per-sample micro clamp that gently limits extreme values without washing out texture, and (ii) Adaptive Quantile Clip (AQClip), which adapts the allowed value corridor per region. AQClip can operate in a proxy mode using local structure statistics or in an attention entropy guided mode (model confidence). Integrated into the CADE 2.5 rendering pipeline, QSilk yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead. It requires no training or fine-tuning and exposes minimal user controls. We report consistent qualitative improvements across SD/SDXL backbones and show synergy with CFG/Rescale, enabling slightly higher guidance without artifacts.
comment: Preprint. Qualitative side-by-side comparisons (fixed seeds); 3 figures with subfigures; 1 algorithm. CADE 2.5 / SDXL integration; sample images included. Code and presets planned for release upon publication
☆ Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity
Poultry farming faces increasing pressure to meet productivity targets while ensuring animal welfare and environmental compliance. Yet many small and medium-sized farms lack affordable, integrated tools for continuous monitoring and decision-making, relying instead on manual, reactive inspections. This paper presents Poultry Farm Intelligence (PoultryFI) - a modular, cost-effective platform that integrates six AI-powered modules: Camera Placement Optimizer, Audio-Visual Monitoring, Analytics & Alerting, Real-Time Egg Counting, Production & Profitability Forecasting, and a Recommendation Module. Camera layouts are first optimized offline using evolutionary algorithms for full poultry house coverage with minimal hardware. The Audio-Visual Monitoring module extracts welfare indicators from synchronized video, audio, and feeding data. Analytics & Alerting produces daily summaries and real-time notifications, while Real-Time Egg Counting uses an edge vision model to automate production tracking. Forecasting models predict egg yield and feed consumption up to 10 days in advance, and the Recommendation Module integrates forecasts with weather data to guide environmental and operational adjustments. This is among the first systems to combine low-cost sensing, edge analytics, and prescriptive AI to continuously monitor flocks, predict production, and optimize performance. Field trials demonstrate 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting. PoultryFI bridges the gap between isolated pilot tools and scalable, farm-wide intelligence, empowering producers to proactively safeguard welfare and profitability.
☆ Semantic segmentation with coarse annotations
Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.
☆ SAMix: Calibrated and Accurate Continual Learning via Sphere-Adaptive Mixup and Neural Collapse
While most continual learning methods focus on mitigating forgetting and improving accuracy, they often overlook the critical aspect of network calibration, despite its importance. Neural collapse, a phenomenon where last-layer features collapse to their class means, has demonstrated advantages in continual learning by reducing feature-classifier misalignment. Few works aim to improve the calibration of continual models for more reliable predictions. Our work goes a step further by proposing a novel method that not only enhances calibration but also improves performance by reducing overconfidence, mitigating forgetting, and increasing accuracy. We introduce Sphere-Adaptive Mixup (SAMix), an adaptive mixup strategy tailored for neural collapse-based methods. SAMix adapts the mixing process to the geometric properties of feature spaces under neural collapse, ensuring more robust regularization and alignment. Experiments show that SAMix significantly boosts performance, surpassing SOTA methods in continual learning while also improving model calibration. SAMix enhances both across-task accuracy and the broader reliability of predictions, making it a promising advancement for robust continual learning systems.
☆ RLAF: Reinforcement Learning from Automaton Feedback
Reinforcement Learning (RL) in environments with complex, history-dependent reward structures poses significant challenges for traditional methods. In this work, we introduce a novel approach that leverages automaton-based feedback to guide the learning process, replacing explicit reward functions with preferences derived from a deterministic finite automaton (DFA). Unlike conventional approaches that use automata for direct reward specification, our method employs the structure of the DFA to generate preferences over trajectories that are used to learn a reward function, eliminating the need for manual reward engineering. Our framework introduces a static approach that uses the learned reward function directly for policy optimization and a dynamic approach that involves continuous refining of the reward function and policy through iterative updates until convergence. Our experiments in both discrete and continuous environments demonstrate that our approach enables the RL agent to learn effective policies for tasks with temporal dependencies, outperforming traditional reward engineering and automaton-based baselines such as reward machines and LTL-guided methods. Our results highlight the advantages of automaton-based preferences in handling non-Markovian rewards, offering a scalable, efficient, and human-independent alternative to traditional reward modeling. We also provide a convergence guarantee showing that under standard assumptions our automaton-guided preference-based framework learns a policy that is near-optimal with respect to the true non-Markovian objective.
☆ ProSh: Probabilistic Shielding for Model-free Reinforcement Learning
Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent's policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.
☆ A Split-Client Approach to Second-Order Optimization
Second-order methods promise faster convergence but are rarely used in practice because Hessian computations and decompositions are far more expensive than gradients. We propose a \emph{split-client} framework where gradients and curvature are computed asynchronously by separate clients. This abstraction captures realistic delays and inexact Hessian updates while avoiding the manual tuning required by Lazy Hessian methods. Focusing on cubic regularization, we show that our approach retains strong convergence guarantees and achieves a provable wall-clock speedup of order $\sqrt{\tau}$, where $\tau$ is the relative time needed to compute and decompose the Hessian compared to a gradient step. Since $\tau$ can be orders of magnitude larger than one in high-dimensional problems, this improvement is practically significant. Experiments on synthetic and real datasets confirm the theory: asynchronous curvature consistently outperforms vanilla and Lazy Hessian baselines, while maintaining second-order accuracy.
☆ ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods -- mainly agentic scaffolding with off-the-shelf LLMs -- struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 49% on Seed-Prover's IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.
comment: 52 pages, 16 figures, website: http://proof-optimizer.github.io/
☆ Constrained Adversarial Perturbation
Deep neural networks have achieved remarkable success in a wide range of classification tasks. However, they remain highly susceptible to adversarial examples - inputs that are subtly perturbed to induce misclassification while appearing unchanged to humans. Among various attack strategies, Universal Adversarial Perturbations (UAPs) have emerged as a powerful tool for both stress testing model robustness and facilitating scalable adversarial training. Despite their effectiveness, most existing UAP methods neglect domain specific constraints that govern feature relationships. Violating such constraints, such as debt to income ratios in credit scoring or packet flow invariants in network communication, can render adversarial examples implausible or easily detectable, thereby limiting their real world applicability. In this work, we advance universal adversarial attacks to constrained feature spaces by formulating an augmented Lagrangian based min max optimization problem that enforces multiple, potentially complex constraints of varying importance. We propose Constrained Adversarial Perturbation (CAP), an efficient algorithm that solves this problem using a gradient based alternating optimization strategy. We evaluate CAP across diverse domains including finance, IT networks, and cyber physical systems, and demonstrate that it achieves higher attack success rates while significantly reducing runtime compared to existing baselines. Our approach also generalizes seamlessly to individual adversarial perturbations, where we observe similar strong performance gains. Finally, we introduce a principled procedure for learning feature constraints directly from data, enabling broad applicability across domains with structured input spaces.
☆ Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction.
☆ KS-Net: Multi-layer network model for determining the rotor type from motor parameters in interior PMSMs
The demand for high efficiency and precise control in electric drive systems has led to the widespread adoption of Interior Permanent Magnet Synchronous Motors (IPMSMs). The performance of these motors is significantly influenced by rotor geometry. Traditionally, rotor shape analysis has been conducted using the finite element method (FEM), which involves high computational costs. This study aims to classify the rotor shape (2D type, V type, Nabla type) of IPMSMs using electromagnetic parameters through machine learning-based methods and to demonstrate the applicability of this approach as an alternative to classical methods. In this context, a custom deep learning model, KS-Net, developed by the user, was comparatively evaluated against Cubic SVM, Quadratic SVM, Fine KNN, Cosine KNN, and Fine Tree algorithms. The balanced dataset, consisting of 9,000 samples, was tested using 10-fold cross-validation, and performance metrics such as accuracy, precision, recall, and F1-score were employed. The results indicate that the Cubic SVM and Quadratic SVM algorithms classified all samples flawlessly, achieving 100% accuracy, while the KS-Net model achieved 99.98% accuracy with only two misclassifications, demonstrating competitiveness with classical methods. This study shows that the rotor shape of IPMSMs can be predicted with high accuracy using data-driven approaches, offering a fast and cost-effective alternative to FEM-based analyses. The findings provide a solid foundation for accelerating motor design processes, developing automated rotor identification systems, and enabling data-driven fault diagnosis in engineering applications.
comment: This study was presented at the 3rd International Conference on Advances and Innovations in Engineering (ICAIE) and published in the conference proceedings
☆ CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning
Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-$N$ sampling often show diminishing returns as $N$ increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-$N$), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature $T$ and additive shift vector $\delta$, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4\times$ fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of $T$ and $\delta$ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration.
☆ Disentanglement of Sources in a Multi-Stream Variational Autoencoder
Variational autoencoders (VAEs) are a leading approach to address the problem of learning disentangled representations. Typically a single VAE is used and disentangled representations are sought in its continuous latent space. Here we explore a different approach by using discrete latents to combine VAE-representations of individual sources. The combination is done based on an explicit model for source combination, and we here use a linear combination model which is well suited, e.g., for acoustic data. We formally define such a multi-stream VAE (MS-VAE) approach, derive its inference and learning equations, and we numerically investigate its principled functionality. The MS-VAE is domain-agnostic, and we here explore its ability to separate sources into different streams using superimposed hand-written digits, and mixed acoustic sources in a speaker diarization task. We observe a clear separation of digits, and on speaker diarization we observe an especially low rate of missed speakers. Numerical experiments further highlight the flexibility of the approach across varying amounts of supervision and training data.
☆ Bayesian Inference for PDE-based Inverse Problems using the Optimization of a Discrete Loss
Inverse problems are crucial for many applications in science, engineering and medicine that involve data assimilation, design, and imaging. Their solution infers the parameters or latent states of a complex system from noisy data and partially observable processes. When measurements are an incomplete or indirect view of the system, additional knowledge is required to accurately solve the inverse problem. Adopting a physical model of the system in the form of partial differential equations (PDEs) is a potent method to close this gap. In particular, the method of optimizing a discrete loss (ODIL) has shown great potential in terms of robustness and computational cost. In this work, we introduce B-ODIL, a Bayesian extension of ODIL, that integrates the PDE loss of ODIL as prior knowledge and combines it with a likelihood describing the data. B-ODIL employs a Bayesian formulation of PDE-based inverse problems to infer solutions with quantified uncertainties. We demonstrate the capabilities of B-ODIL in a series of synthetic benchmarks involving PDEs in one, two, and three dimensions. We showcase the application of B-ODIL in estimating tumor concentration and its uncertainty in a patient's brain from MRI scans using a three-dimensional tumor growth model.
☆ WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables
Fast and efficient machine learning is of growing interest to the scientific community and has spurred significant research into novel model architectures and hardware-aware design. Recent hard? and software co-design approaches have demonstrated impressive results with entirely multiplication-free models. Differentiable Logic Gate Networks (DLGNs), for instance, provide a gradient-based framework for learning optimal combinations of low-level logic gates, setting state-of-the-art trade-offs between accuracy, resource usage, and latency. However, these models suffer from high computational cost during training and do not generalize well to logic blocks with more inputs. In this work, we introduce Walsh-Assisted Relaxation for Probabilistic Look-Up Tables (WARP-LUTs) - a novel gradient-based method that efficiently learns combinations of logic gates with substantially fewer trainable parameters. We demonstrate that WARP-LUTs achieve significantly faster convergence on CIFAR-10 compared to DLGNs, while maintaining comparable accuracy. Furthermore, our approach suggests potential for extension to higher-input logic blocks, motivating future research on extremely efficient deployment on modern FPGAs and its real-time science applications.
comment: Preprint. Under review
Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization
The Tsetlin Machine (TM) offers high-speed inference on resource-constrained devices such as CPUs. Its logic-driven operations naturally lend themselves to parallel execution on modern CPU architectures. Motivated by this, we propose an efficient software implementation of the TM by leveraging instruction-level bitwise operations for compact model representation and accelerated processing. To further improve inference speed, we introduce an early exit mechanism, which exploits the TM's AND-based clause evaluation to avoid unnecessary computations. Building upon this, we propose a literal Reorder strategy designed to maximize the likelihood of early exits. This strategy is applied during a post-training, pre-inference stage through statistical analysis of all literals and the corresponding actions of their associated Tsetlin Automata (TA), introducing negligible runtime overhead. Experimental results using the gem5 simulator with an ARM processor show that our optimized implementation reduces inference time by up to 96.71% compared to the conventional integer-based TM implementations while maintaining comparable code density.
☆ GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters
The growing demand for computational resources in machine learning has made efficient resource allocation a critical challenge, especially in heterogeneous hardware clusters where devices vary in capability, age, and energy efficiency. Upgrading to the latest hardware is often infeasible, making sustainable use of existing, mixed-generation resources essential. In this paper, we propose a learning-based architecture for managing machine learning workloads in heterogeneous clusters. The system operates online, allocating resources to incoming training or inference requests while minimizing energy consumption and meeting performance requirements. It uses two neural networks: the first provides initial estimates of how well a new model will utilize different hardware types and how it will affect co-located models. An optimizer then allocates resources based on these estimates. After deployment, the system monitors real performance and uses this data to refine its predictions via a second neural network. This updated model improves estimates not only for the current hardware but also for hardware not initially allocated and for co-location scenarios not yet observed. The result is an adaptive, iterative approach that learns over time to make more effective resource allocation decisions in heterogeneous deep learning clusters.
comment: 10 pages, 5 figures
☆ Deep Neural ODE Operator Networks for PDEs
Operator learning has emerged as a promising paradigm for developing efficient surrogate models to solve partial differential equations (PDEs). However, existing approaches often overlook the domain knowledge inherent in the underlying PDEs and hence suffer from challenges in capturing temporal dynamics and generalization issues beyond training time frames. This paper introduces a deep neural ordinary differential equation (ODE) operator network framework, termed NODE-ONet, to alleviate these limitations. The framework adopts an encoder-decoder architecture comprising three core components: an encoder that spatially discretizes input functions, a neural ODE capturing latent temporal dynamics, and a decoder reconstructing solutions in physical spaces. Theoretically, error analysis for the encoder-decoder architecture is investigated. Computationally, we propose novel physics-encoded neural ODEs to incorporate PDE-specific physical properties. Such well-designed neural ODEs significantly reduce the framework's complexity while enhancing numerical efficiency, robustness, applicability, and generalization capacity. Numerical experiments on nonlinear diffusion-reaction and Navier-Stokes equations demonstrate high accuracy, computational efficiency, and prediction capabilities beyond training time frames. Additionally, the framework's flexibility to accommodate diverse encoders/decoders and its ability to generalize across related PDE families further underscore its potential as a scalable, physics-encoded tool for scientific machine learning.
☆ Decentralized Parameter-Free Online Learning
We propose the first parameter-free decentralized online learning algorithms with network regret guarantees, which achieve sublinear regret without requiring hyperparameter tuning. This family of algorithms connects multi-agent coin-betting and decentralized online learning via gossip steps. To enable our decentralized analysis, we introduce a novel "betting function" formulation for coin-betting that simplifies the multi-agent regret analysis. Our analysis shows sublinear network regret bounds and is validated through experiments on synthetic and real datasets. This family of algorithms is applicable to distributed sensing, decentralized optimization, and collaborative ML applications.
☆ Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation
The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present \texttt{freephdlabor}, an open-source multiagent framework featuring \textit{fully dynamic workflows} determined by real-time agent reasoning and a \coloremph{\textit{modular architecture}} enabling seamless customization -- users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including \textit{automatic context compaction}, \textit{workspace-based communication} to prevent information degradation, \textit{memory persistence} across sessions, and \textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into \textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research -- from ideation through experimentation to publication-ready manuscripts.
comment: 37 pages, 5 figures. Code: https://github.com/ltjed/freephdlabor
☆ CQD-SHAP: Explainable Complex Query Answering via Shapley Values
Complex query answering (CQA) goes beyond the well-studied link prediction task by addressing more sophisticated queries that require multi-hop reasoning over incomplete knowledge graphs (KGs). Research on neural and neurosymbolic CQA methods is still an emerging field. Almost all of these methods can be regarded as black-box models, which may raise concerns about user trust. Although neurosymbolic approaches like CQD are slightly more interpretable, allowing intermediate results to be tracked, the importance of different parts of the query remains unexplained. In this paper, we propose CQD-SHAP, a novel framework that computes the contribution of each query part to the ranking of a specific answer. This contribution explains the value of leveraging a neural predictor that can infer new knowledge from an incomplete KG, rather than a symbolic approach relying solely on existing facts in the KG. CQD-SHAP is formulated based on Shapley values from cooperative game theory and satisfies all the fundamental Shapley axioms. Automated evaluation of these explanations in terms of necessary and sufficient explanations, and comparisons with various baselines, shows the effectiveness of this approach for most query types.
☆ GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device
Semantic top-K selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, GRATING. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate GRATING against state-of-the-art baselines on rerankers from 0.6B to 8B parameters across Apple M2 and RTX 5070. GRATING consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, GRATING lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability.
☆ Stochastic Optimization with Random Search
We revisit random search for stochastic optimization, where only noisy function evaluations are available. We show that the method works under weaker smoothness assumptions than previously considered, and that stronger assumptions enable improved guarantees. In the finite-sum setting, we design a variance-reduced variant that leverages multiple samples to accelerate convergence. Our analysis relies on a simple translation invariance property, which provides a principled way to balance noise and reduce variance.
Kernel-Based Evaluation of Conditional Biological Sequence Models
We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.
comment: 29 pages
☆ Attn-JGNN: Attention Enhanced Join-Graph Neural Networks
We propose an Attention Enhanced Join-Graph Neural Networks(Attn-JGNN) model for solving #SAT problems, which significantly improves the solving accuracy. Inspired by the Iterative Join Graph Propagation (IJGP) algorithm, Attn-JGNN uses tree decomposition to encode the CNF formula into a join-graph, then performs iterative message passing on the join-graph, and finally approximates the model number by learning partition functions. In order to further improve the accuracy of the solution, we apply the attention mechanism in and between clusters of the join-graphs, which makes Attn-JGNN pay more attention to the key variables and clusters in probabilistic inference, and reduces the redundant calculation. Finally, our experiments show that our Attn-JGNN model achieves better results than other neural network methods.
☆ On the Neural Feature Ansatz for Deep Neural Networks
Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power $\alpha>0$ of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent $\alpha = 1/2$ (Radhakrishnan et al., 2024). We extend this result to networks with $L \geq 2$ layers, showing that the NFA holds with exponent $\alpha = 1/L$, thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.
☆ Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems
We introduce the Strategic Doubly Robust (SDR) estimator, a novel framework that integrates strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments. SDR addresses endogenous treatment assignment arising from strategic agent behavior, maintaining double robustness while incorporating strategic considerations. Theoretical analysis confirms SDR's consistency and asymptotic normality under strategic unconfoundedness. Empirical evaluations demonstrate SDR's superior performance over baseline methods, achieving 7.6\%-29.3\% bias reduction across varying strategic strengths and maintaining robust scalability with agent populations. The framework provides a principled approach for reliable causal inference when agents respond strategically to interventions.
☆ Rethinking Cross-lingual Gaps from a Statistical Viewpoint
Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.
comment: 22 pages
☆ Geometric Convergence Analysis of Variational Inference via Bregman Divergences
Variational Inference (VI) provides a scalable framework for Bayesian inference by optimizing the Evidence Lower Bound (ELBO), but convergence analysis remains challenging due to the objective's non-convexity and non-smoothness in Euclidean space. We establish a novel theoretical framework for analyzing VI convergence by exploiting the exponential family structure of distributions. We express negative ELBO as a Bregman divergence with respect to the log-partition function, enabling a geometric analysis of the optimization landscape. We show that this Bregman representation admits a weak monotonicity property that, while weaker than convexity, provides sufficient structure for rigorous convergence analysis. By deriving bounds on the objective function along rays in parameter space, we establish properties governed by the spectral characteristics of the Fisher information matrix. Under this geometric framework, we prove non-asymptotic convergence rates for gradient descent algorithms with both constant and diminishing step sizes.
comment: 14 pages, 4 figures
☆ Hypergraph Contrastive Sensor Fusion for Multimodal Fault Diagnosis in Induction Motors
Reliable induction motor (IM) fault diagnosis is vital for industrial safety and operational continuity, mitigating costly unplanned downtime. Conventional approaches often struggle to capture complex multimodal signal relationships, are constrained to unimodal data or single fault types, and exhibit performance degradation under noisy or cross-domain conditions. This paper proposes the Multimodal Hypergraph Contrastive Attention Network (MM-HCAN), a unified framework for robust fault diagnosis. To the best of our knowledge, MM-HCAN is the first to integrate contrastive learning within a hypergraph topology specifically designed for multimodal sensor fusion, enabling the joint modelling of intra- and inter-modal dependencies and enhancing generalisation beyond Euclidean embedding spaces. The model facilitates simultaneous diagnosis of bearing, stator, and rotor faults, addressing the engineering need for consolidated di- agnostic capabilities. Evaluated on three real-world benchmarks, MM-HCAN achieves up to 99.82% accuracy with strong cross-domain generalisation and resilience to noise, demonstrating its suitability for real-world deployment. An ablation study validates the contribution of each component. MM-HCAN provides a scalable and robust solution for comprehensive multi-fault diagnosis, supporting predictive maintenance and extended asset longevity in industrial environments.
comment: Submitted to IEEE Sensors Journal
☆ SpikeFit: Towards Optimal Deployment of Spiking Networks on Neuromorphic Hardware
This paper introduces SpikeFit, a novel training method for Spiking Neural Networks (SNNs) that enables efficient inference on neuromorphic hardware, considering all its stringent requirements: the number of neurons and synapses that can fit on a single device, and lower bit-width representations (e.g., 4-bit, 8-bit). Unlike conventional compressing approaches that address only a subset of these requirements (limited numerical precision and limited number of neurons in the network), SpikeFit treats the allowed weights' discrete values themselves as learnable parameters co-optimized with the model, allowing for optimal Clusterization-Aware Training (CAT) of the model's weights at low precision (2-, 4-, or 8-bit) which results in higher network compression efficiency, as well as limiting the number of unique synaptic connections to a value required by neuromorphic processor. This joint optimization allows SpikeFit to find a discrete weight set aligned with hardware constraints, enabling the most complete deployment across a broader range of neuromorphic processors than existing methods of SNN compression support. Moreover, SpikeFit introduces a new hardware-friendly Fisher Spike Contribution (FSC) pruning method showing the state-of-the-art performance. We demonstrate that for spiking neural networks constrained to only four unique synaptic weight values (M = 4), our SpikeFit method not only outperforms state-of-the-art SNNs compression methods and conventional baselines combining extreme quantization schemes and clustering algorithms, but also meets a wider range of neuromorphic hardware requirements and provides the lowest energy use in experiments.
comment: 13 pages, 2 figures. Work in progress
☆ An Empirical Study on MC Dropout--Based Uncertainty--Error Correlation in 2D Brain Tumor Segmentation
Accurate brain tumor segmentation from MRI is vital for diagnosis and treatment planning. Although Monte Carlo (MC) Dropout is widely used to estimate model uncertainty, its effectiveness in identifying segmentation errors -- especially near tumor boundaries -- remains unclear. This study empirically examines the relationship between MC Dropout--based uncertainty and segmentation error in 2D brain tumor MRI segmentation using a U-Net trained under four augmentation settings: none, horizontal flip, rotation, and scaling. Uncertainty was computed from 50 stochastic forward passes and correlated with pixel-wise errors using Pearson and Spearman coefficients. Results show weak global correlations ($r \approx 0.30$--$0.38$) and negligible boundary correlations ($|r| < 0.05$). Although differences across augmentations were statistically significant ($p < 0.001$), they lacked practical relevance. These findings suggest that MC Dropout uncertainty provides limited cues for boundary error localization, underscoring the need for alternative or hybrid uncertainty estimation methods in medical image segmentation.
comment: Code and results available at https://github.com/Saumya4321/mc-dropout-boundary-correlation
☆ Compressive Modeling and Visualization of Multivariate Scientific Data using Implicit Neural Representation
The extensive adoption of Deep Neural Networks has led to their increased utilization in challenging scientific visualization tasks. Recent advancements in building compressed data models using implicit neural representations have shown promising results for tasks like spatiotemporal volume visualization and super-resolution. Inspired by these successes, we develop compressed neural representations for multivariate datasets containing tens to hundreds of variables. Our approach utilizes a single network to learn representations for all data variables simultaneously through parameter sharing. This allows us to achieve state-of-the-art data compression. Through comprehensive evaluations, we demonstrate superior performance in terms of reconstructed data quality, rendering and visualization quality, preservation of dependency information among variables, and storage efficiency.
comment: Accepted for publication in 16th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2025)
☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
☆ Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models
Modern face recognition models achieve high overall accuracy but continue to exhibit systematic biases that disproportionately affect certain subpopulations. Conventional bias evaluation frameworks rely on labeled attributes to form subpopulations, which are expensive to obtain and limited to predefined categories. We introduce Latent Feature Alignment (LFA), an attribute-label-free algorithm that uses latent directions to identify subpopulations. This yields two main benefits over standard clustering: (i) semantically coherent grouping, where faces sharing common attributes are grouped together more reliably than by proximity-based methods, and (ii) discovery of interpretable directions, which correspond to semantic attributes such as age, ethnicity, or attire. Across four state-of-the-art recognition models (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence, while uncovering interpretable latent directions aligned with demographic and contextual attributes. These results position LFA as a practical method for representation auditing of face recognition models, enabling practitioners to identify and interpret biased subpopulations without predefined attribute annotations.
☆ Revisiting Knowledge Distillation: The Hidden Role of Dataset Size
The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.
☆ Language Models are Injective and Hence Invertible
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
☆ Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
In this study, we propose an enhancement to the similarity computation mechanism in multi-modal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.
☆ The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling
Reinforcement learning (RL) has been pivotal in enhancing the reasoning capabilities of large language models (LLMs), but it often suffers from limited exploration and entropy collapse, where models exploit a narrow set of solutions, leading to a loss of sampling diversity and subsequently preventing RL from further improving performance. This issue is exacerbated in parallel sampling methods, where multiple outputs are drawn from the same distribution, potentially causing the model to converge to similar solutions. We propose SESA, a novel SEquential SAmpling framework that mitigates this challenge by generating diverse solution sketches sequentially before expanding them into full reasoning paths. This approach ensures broader exploration by conditioning each new output on previous ones, promoting diversity throughout the process and preventing policy collapse. Our experiments on a synthetic task show that sequential sampling consistently outperforms traditional RL methods in terms of path diversity and recovery from collapse. Further evaluations on real-world tasks demonstrate that SESA improves both the exploration of valid strategies and the overall performance of LLMs. On three agent benchmarks, SESA lifts success rates by $+0.25$, $+0.42$, and $+0.07$ absolute over the base model (up to an additional $211\%$ relative improvement over baseline RL), underscoring its exploration advantage. This work introduces a structured approach to exploration, paving the way for more effective and diverse reasoning in RL-trained LLMs. Our code is released at https://github.com/MuLabPKU/sesa.
☆ DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.
comment: 28 pages, 17 figures, accepted by NeruIPS 2025
☆ OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning
Reinforcement learning algorithms typically utilize an interactive simulator (i.e., environment) with a predefined reward function for policy training. Developing such simulators and manually defining reward functions, however, is often time-consuming and labor-intensive. To address this, we propose an Offline Simulator (OffSim), a novel model-based offline inverse reinforcement learning (IRL) framework, to emulate environmental dynamics and reward structure directly from expert-generated state-action trajectories. OffSim jointly optimizes a high-entropy transition model and an IRL-based reward function to enhance exploration and improve the generalizability of the learned reward. Leveraging these learned components, OffSim can subsequently train a policy offline without further interaction with the real environment. Additionally, we introduce OffSim$^+$, an extension that incorporates a marginal reward for multi-dataset settings to enhance exploration. Extensive MuJoCo experiments demonstrate that OffSim achieves substantial performance gains over existing offline IRL methods, confirming its efficacy and robustness.
☆ AI and analytics in sports: Leveraging BERTopic to map the past and chart the future
Purpose: The purpose of this study is to map the body of scholarly literature at the intersection of artificial intelligence (AI), analytics and sports and thereafter, leverage the insights generated to chart guideposts for future research. Design/methodology/approach: The study carries out systematic literature review (SLR). Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) protocol is leveraged to identify 204 journal articles pertaining to utilization of AI and analytics in sports published during 2002 to 2024. We follow it up with extraction of the latent topics from sampled articles by leveraging the topic modelling technique of BERTopic. Findings: The study identifies the following as predominant areas of extant research on usage of AI and analytics in sports: performance modelling, physical and mental health, social media sentiment analysis, and tactical tracking. Each extracted topic is further examined in terms of its relative prominence, representative studies, and key term associations. Drawing on these insights, the study delineates promising avenues for future inquiry. Research limitations/implications: The study offers insights to academicians and sports administrators on transformational impact of AI and analytics in sports. Originality/value: The study introduces BERTopic as a novel approach for extracting latent structures in sports research, thereby advancing both scholarly understanding and the methodological toolkit of the field.
comment: 32 pages, 5 figures, 1 table, accepted for presentation at Australia and New Zealand Marketing Academy (ANZMAC) - 2025 Conference
☆ Online Policy Learning via a Self-Normalized Maximal Inequality
Adaptive experiments produce dependent data that break i.i.d. assumptions that underlie classical concentration bounds and invalidate standard learning guarantees. In this paper, we develop a self-normalized maximal inequality for martingale empirical processes. Building on this, we first propose an adaptive sample-variance penalization procedure which balances empirical loss and sample variance, valid for general dependent data. Next, this allows us to derive a new variance-regularized pessimistic off-policy learning objective, for which we establish excess-risk guarantees. Subsequently, we show that, when combined with sequential updates and under standard complexity and margin conditions, the resulting estimator achieves fast convergence rates in both parametric and nonparametric regimes, improving over the usual $1/\sqrt{n}$ baseline. We complement our theoretical findings with numerical simulations that illustrate the practical gains of our approach.
☆ Adversary-Free Counterfactual Prediction via Information-Regularized Representations
We study counterfactual prediction under assignment bias and propose a mathematically grounded, information-theoretic approach that removes treatment-covariate dependence without adversarial training. Starting from a bound that links the counterfactual-factual risk gap to mutual information, we learn a stochastic representation Z that is predictive of outcomes while minimizing I(Z; T). We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised decoder, yielding a stable, provably motivated training criterion. The framework extends naturally to dynamic settings by applying the information penalty to sequential representations at each decision time. We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines. Across metrics of likelihood, counterfactual error, and policy evaluation, our approach performs favorably while avoiding the training instabilities and tuning burden of adversarial schemes.
☆ Learning to Answer from Correct Demonstrations
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.
comment: Comments are welcome
☆ Robust Optimization in Causal Models and G-Causal Normalizing Flows
In this paper, we show that interventionally robust optimization problems in causal models are continuous under the $G$-causal Wasserstein distance, but may be discontinuous under the standard Wasserstein distance. This highlights the importance of using generative models that respect the causal structure when augmenting data for such tasks. To this end, we propose a new normalizing flow architecture that satisfies a universal approximation property for causal structural models and can be efficiently trained to minimize the $G$-causal Wasserstein distance. Empirically, we demonstrate that our model outperforms standard (non-causal) generative models in data augmentation for causal regression and mean-variance portfolio optimization in causal factor models.
☆ Expediting Reinforcement Learning by Incorporating Knowledge About Temporal Causality in the Environment
Reinforcement learning (RL) algorithms struggle with learning optimal policies for tasks where reward feedback is sparse and depends on a complex sequence of events in the environment. Probabilistic reward machines (PRMs) are finite-state formalisms that can capture temporal dependencies in the reward signal, along with nondeterministic task outcomes. While special RL algorithms can exploit this finite-state structure to expedite learning, PRMs remain difficult to modify and design by hand. This hinders the already difficult tasks of utilizing high-level causal knowledge about the environment, and transferring the reward formalism into a new domain with a different causal structure. This paper proposes a novel method to incorporate causal information in the form of Temporal Logic-based Causal Diagrams into the reward formalism, thereby expediting policy learning and aiding the transfer of task specifications to new environments. Furthermore, we provide a theoretical result about convergence to optimal policy for our method, and demonstrate its strengths empirically.
comment: Please cite the proceedings version. Source code: https://github.com/corazza/tcrl
☆ Particle Dynamics for Latent-Variable Energy-Based Models
Latent-variable energy-based models (LVEBMs) assign a single normalized energy to joint pairs of observed data and latent variables, offering expressive generative modeling while capturing hidden structure. We recast maximum-likelihood training as a saddle problem over distributions on the latent and joint manifolds and view the inner updates as coupled Wasserstein gradient flows. The resulting algorithm alternates overdamped Langevin updates for a joint negative pool and for conditional latent particles with stochastic parameter ascent, requiring no discriminator or auxiliary networks. We prove existence and convergence under standard smoothness and dissipativity assumptions, with decay rates in KL divergence and Wasserstein-2 distance. The saddle-point view further yields an ELBO strictly tighter than bounds obtained with restricted amortized posteriors. Our method is evaluated on numerical approximations of physical systems and performs competitively against comparable approaches.
☆ A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning NeurIPS 2025
Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.
comment: Accepted by NeurIPS 2025
☆ Nonlinear Dimensionality Reduction Techniques for Bayesian Optimization
Bayesian optimisation (BO) is a standard approach for sample-efficient global optimisation of expensive black-box functions, yet its scalability to high dimensions remains challenging. Here, we investigate nonlinear dimensionality reduction techniques that reduce the problem to a sequence of low-dimensional Latent-Space BO (LSBO). While early LSBO methods used (linear) random projections (Wang et al., 2013), building on Grosnit et al. (2021), we employ Variational Autoencoders (VAEs) for LSBO, focusing on deep metric loss for structured latent manifolds and VAE retraining to adapt the encoder-decoder to newly sampled regions. We propose some changes in their implementation, originally designed for tasks such as molecule generation, and reformulate the algorithm for broader optimisation purposes. We then couple LSBO with Sequential Domain Reduction (SDR) directly in the latent space (SDR-LSBO), yielding an algorithm that narrows the latent search domains as evidence accumulates. Implemented in a GPU-accelerated BoTorch stack with Matern-5/2 Gaussian process surrogates, our numerical results show improved optimisation quality across benchmark tasks and that structured latent manifolds improve BO performance. Additionally, we compare random embeddings and VAEs as two mechanisms for dimensionality reduction, showing that the latter outperforms the former. To the best of our knowledge, this is the first study to combine SDR with VAE-based LSBO, and our analysis clarifies design choices for metric shaping and retraining that are critical for scalable latent space BO. For reproducibility, our source code is available at https://github.com/L-Lok/Nonlinear-Dimensionality-Reduction-Techniques-for-Bayesian-Optimization.git.
comment: 34 pages including appendixes, 8 figures. Keywords: global optimisation, dimensionality reduction techniques, Bayesian methods, Variational Autoencoders
☆ Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety
Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.
comment: 11 pages, 10 figures, The 8th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '25), November 3--6, 2025, Minneapolis, MN, USA
☆ Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models
This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models. The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss. The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability. The final part examines the trade-offs between efficiency and effectiveness in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO's clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes.
comment: PhD Thesis of Shashank Gupta defended at the University of Amsterdam on October 13th 2025
☆ ParaFormer: Shallow Parallel Transformers with Progressive Approximation
The widespread 'deeper is better' philosophy has driven the creation of architectures like ResNet and Transformer, which achieve high performance by stacking numerous layers. However, increasing model depth comes with challenges such as longer training times, higher inference latency, and impracticality on resource-constrained devices. To address these issues, we propose ParaFormer, a shallow Transformer architecture designed for true parallelism in both structure and computation. By formulating standard Transformers as function approximators in closed-form, our theoretical analysis shows that their performance relies on inter-layer collaboration for progressive approximation, rather than depth itself. While deep Transformers enforce this collaboration through sequential designs, we demonstrate that such collaboration is not inherently tied to sequential structures. ParaFormer removes the sequential constraint by organizing layers into parallel branches, enforcing inter-layer collaboration algorithmically. Specifically, we implement progressive approximation, ensuring that each new branch further reduces the loss from preceding branches, enabling faster convergence. Extensive experiments validate ParaFormer's effectiveness, outperforming standard Transformers like ViT. Moreover, ParaFormer supports up to 15.07x model compression and facilitates model expansion for adaptive continuous learning. Experimental results on multi-GPU deployment demonstrate that ParaFormer is 3.30x faster than widely used parallelism solutions such as FairScale. These advancements stem from our closed-form formulation of Transformers based on the Universal Approximation Theorem, which not only explains the ``depth belief'' but also opens new avenues for designing efficient Transformer architectures. Source code: https://(open-upon-acceptance)
☆ Information Theory in Open-world Machine Learning Foundations, Frameworks, and Future Direction
Open world Machine Learning (OWML) aims to develop intelligent systems capable of recognizing known categories, rejecting unknown samples, and continually learning from novel information. Despite significant progress in open set recognition, novelty detection, and continual learning, the field still lacks a unified theoretical foundation that can quantify uncertainty, characterize information transfer, and explain learning adaptability in dynamic, nonstationary environments. This paper presents a comprehensive review of information theoretic approaches in open world machine learning, emphasizing how core concepts such as entropy, mutual information, and Kullback Leibler divergence provide a mathematical language for describing knowledge acquisition, uncertainty suppression, and risk control under open world conditions. We synthesize recent studies into three major research axes: information theoretic open set recognition enabling safe rejection of unknowns, information driven novelty discovery guiding new concept formation, and information retentive continual learning ensuring stable long term adaptation. Furthermore, we discuss theoretical connections between information theory and provable learning frameworks, including PAC Bayes bounds, open-space risk theory, and causal information flow, to establish a pathway toward provable and trustworthy open world intelligence. Finally, the review identifies key open problems and future research directions, such as the quantification of information risk, development of dynamic mutual information bounds, multimodal information fusion, and integration of information theory with causal reasoning and world model learning.
☆ Online Kernel Dynamic Mode Decomposition for Streaming Time Series Forecasting with Adaptive Windowing
Real-time forecasting from streaming data poses critical challenges: handling non-stationary dynamics, operating under strict computational limits, and adapting rapidly without catastrophic forgetting. However, many existing approaches face trade-offs between accuracy, adaptability, and efficiency, particularly when deployed in constrained computing environments. We introduce WORK-DMD (Windowed Online Random Kernel Dynamic Mode Decomposition), a method that combines Random Fourier Features with online Dynamic Mode Decomposition to capture nonlinear dynamics through explicit feature mapping, while preserving fixed computational cost and competitive predictive accuracy across evolving data. WORK-DMD employs Sherman-Morrison updates within rolling windows, enabling continuous adaptation to evolving dynamics from only current data, eliminating the need for lengthy training or large storage requirements for historical data. Experiments on benchmark datasets across several domains show that WORK-DMD achieves higher accuracy than several state-of-the-art online forecasting methods, while requiring only a single pass through the data and demonstrating particularly strong performance in short-term forecasting. Our results show that combining kernel evaluations with adaptive matrix updates achieves strong predictive performance with minimal data requirements. This sample efficiency offers a practical alternative to deep learning for streaming forecasting applications.
☆ Geometric Mixture Models for Electrolyte Conductivity Prediction
Accurate prediction of ionic conductivity in electrolyte systems is crucial for advancing numerous scientific and technological applications. While significant progress has been made, current research faces two fundamental challenges: (1) the lack of high-quality standardized benchmarks, and (2) inadequate modeling of geometric structure and intermolecular interactions in mixture systems. To address these limitations, we first reorganize and enhance the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph representations of molecules. We then propose GeoMix, a novel geometry-aware framework that preserves Set-SE(3) equivariance-an essential but challenging property for mixture systems. At the heart of GeoMix lies the Geometric Interaction Network (GIN), an equivariant module specifically designed for intermolecular geometric message passing. Comprehensive experiments demonstrate that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing for accurate property prediction. This work not only establishes new benchmarks for electrolyte research but also provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.
☆ LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding
Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/
☆ Recursive Inference for Heterogeneous Multi-Output GP State-Space Models with Arbitrary Moment Matching
Accurate learning of system dynamics is becoming increasingly crucial for advanced control and decision-making in engineering. However, real-world systems often exhibit multiple channels and highly nonlinear transition dynamics, challenging traditional modeling methods. To enable online learning for these systems, this paper formulates the system as Gaussian process state-space models (GPSSMs) and develops a recursive learning method. The main contributions are threefold. First, a heterogeneous multi-output kernel is designed, allowing each output dimension to adopt distinct kernel types, hyperparameters, and input variables, improving expressiveness in multi-dimensional dynamics learning. Second, an inducing-point management algorithm enhances computational efficiency through independent selection and pruning for each output dimension. Third, a unified recursive inference framework for GPSSMs is derived, supporting general moment matching approaches, including the extended Kalman filter (EKF), unscented Kalman filter (UKF), and assumed density filtering (ADF), enabling accurate learning under strong nonlinearity and significant noise. Experiments on synthetic and real-world datasets show that the proposed method matches the accuracy of SOTA offline GPSSMs with only 1/100 of the runtime, and surpasses SOTA online GPSSMs by around 70% in accuracy under heavy noise while using only 1/20 of the runtime.
☆ Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning
While behavior cloning with flow/diffusion policies excels at learning complex skills from demonstrations, it remains vulnerable to distributional shift, and standard RL methods struggle to fine-tune these models due to their iterative inference process and the limitations of existing workarounds. In this work, we introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme inherently aligns it with the variational Jordan-Kinderlehrer-Otto (JKO) principle from optimal transport. SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions. Each step corresponds to a JKO update, regularizing policy changes to stay near the previous iterate and ensuring stable online adaptation with entropic regularization. This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages: simpler/faster training of sub-models, reduced computational/memory costs, and provable stability grounded in Wasserstein trust regions. Comprehensive experiments demonstrate SWFP's enhanced stability, efficiency, and superior adaptation performance across diverse robotic control benchmarks.
☆ Towards Robust Zero-Shot Reinforcement Learning
The recent development of zero-shot reinforcement learning (RL) has opened a new avenue for learning pre-trained generalist policies that can adapt to arbitrary new tasks in a zero-shot manner. While the popular Forward-Backward representations (FB) and related methods have shown promise in zero-shot RL, we empirically found that their modeling lacks expressivity and that extrapolation errors caused by out-of-distribution (OOD) actions during offline learning sometimes lead to biased representations, ultimately resulting in suboptimal performance. To address these issues, we propose Behavior-REgularizEd Zero-shot RL with Expressivity enhancement (BREEZE), an upgraded FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality. BREEZE introduces behavioral regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm. Additionally, BREEZE extracts the policy using a task-conditioned diffusion model, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings. Moreover, BREEZE employs expressive attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics. Extensive experiments on ExORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. The official implementation is available at: https://github.com/Whiterrrrr/BREEZE.
comment: Neurips 2025, 36 pages, 18 figures
☆ Towards Flash Thinking via Decoupled Advantage Policy Optimization
Recent Large Reasoning Models (LRMs) have achieved remarkable performance in solving complex problems via supervised fine-tuning (SFT) and reinforcement learning (RL). Although existing RL algorithms significantly enhance model accuracy, they still suffer from excessively lengthy responses and overthinking issues, resulting in increased inference latency and computational consumption, especially for simple tasks that require minimal reasoning. To address this, we propose a novel RL framework, DEPO, to reduce inefficient reasoning for models. Our method mainly consists of three core components: (1) an innovative advantage decoupled algorithm to guide model reduction of inefficient tokens; (2) a difficulty-aware length penalty to lower the overall length of model responses; (3) an advantage clipping method to prevent bias in policy optimization. In our experiments, applied to DeepSeek-Distill-Qwen-7B and DeepSeek-Distill-Qwen-1.5B as base models, DEPO achieves a significant reduction in sequence length by 39% and reduces excessive reasoning paths in inefficient tokens, while outperforming the base model in overall accuracy.
☆ Sequence Modeling with Spectral Mean Flows
A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets. Code is available at https://github.com/jw9730/spectral-mean-flow.
comment: 30 pages, 9 figures
☆ TranSimHub:A Unified Air-Ground Simulation Platform for Multi-Modal Perception and Decision-Making
Air-ground collaborative intelligence is becoming a key approach for next-generation urban intelligent transportation management, where aerial and ground systems work together on perception, communication, and decision-making. However, the lack of a unified multi-modal simulation environment has limited progress in studying cross-domain perception, coordination under communication constraints, and joint decision optimization. To address this gap, we present TranSimHub, a unified simulation platform for air-ground collaborative intelligence. TranSimHub offers synchronized multi-view rendering across RGB, depth, and semantic segmentation modalities, ensuring consistent perception between aerial and ground viewpoints. It also supports information exchange between the two domains and includes a causal scene editor that enables controllable scenario creation and counterfactual analysis under diverse conditions such as different weather, emergency events, and dynamic obstacles. We release TranSimHub as an open-source platform that supports end-to-end research on perception, fusion, and control across realistic air and ground traffic scenes. Our code is available at https://github.com/Traffic-Alpha/TranSimHub.
comment: 9 pages, 4 figures
Kernel Regression in Structured Non-IID Settings: Theory and Implications for Denoising Score Learning
Kernel ridge regression (KRR) is a foundational tool in machine learning, with recent work emphasizing its connections to neural networks. However, existing theory primarily addresses the i.i.d. setting, while real-world data often exhibits structured dependencies - particularly in applications like denoising score learning where multiple noisy observations derive from shared underlying signals. We present the first systematic study of KRR generalization for non-i.i.d. data with signal-noise causal structure, where observations represent different noisy views of common signals. By developing a novel blockwise decomposition method that enables precise concentration analysis for dependent data, we derive excess risk bounds for KRR that explicitly depend on: (1) the kernel spectrum, (2) causal structure parameters, and (3) sampling mechanisms (including relative sample sizes for signals and noises). We further apply our results to denoising score learning, establishing generalization guarantees and providing principled guidance for sampling noisy data points. This work advances KRR theory while providing practical tools for analyzing dependent data in modern machine learning applications.
☆ RankSEG-RMA: An Efficient Segmentation Algorithm via Reciprocal Moment Approximation
Semantic segmentation labels each pixel in an image with its corresponding class, and is typically evaluated using the Intersection over Union (IoU) and Dice metrics to quantify the overlap between predicted and ground-truth segmentation masks. In the literature, most existing methods estimate pixel-wise class probabilities, then apply argmax or thresholding to obtain the final prediction. These methods have been shown to generally lead to inconsistent or suboptimal results, as they do not directly maximize segmentation metrics. To address this issue, a novel consistent segmentation framework, RankSEG, has been proposed, which includes RankDice and RankIoU specifically designed to optimize the Dice and IoU metrics, respectively. Although RankSEG almost guarantees improved performance, it suffers from two major drawbacks. First, it is its computational expense-RankDice has a complexity of O(d log d) with a substantial constant factor (where d represents the number of pixels), while RankIoU exhibits even higher complexity O(d^2), thus limiting its practical application. For instance, in LiTS, prediction with RankSEG takes 16.33 seconds compared to just 0.01 seconds with the argmax rule. Second, RankSEG is only applicable to overlapping segmentation settings, where multiple classes can occupy the same pixel, which contrasts with standard benchmarks that typically assume non-overlapping segmentation. In this paper, we overcome these two drawbacks via a reciprocal moment approximation (RMA) of RankSEG with the following contributions: (i) we improve RankSEG using RMA, namely RankSEG-RMA, reduces the complexity of both algorithms to O(d) while maintaining comparable performance; (ii) inspired by RMA, we develop a pixel-wise score function that allows efficient implementation for non-overlapping segmentation settings.
☆ Singularity-free dynamical invariants-based quantum control
State preparation is a cornerstone of quantum technologies, underpinning applications in computation, communication, and sensing. Its importance becomes even more pronounced in non-Markovian open quantum systems, where environmental memory and model uncertainties pose significant challenges to achieving high-fidelity control. Invariant-based inverse engineering provides a principled framework for synthesizing analytic control fields, yet existing parameterizations often lead to experimentally infeasible, singular pulses and are limited to simplified noise models such as those of Lindblad form. Here, we introduce a generalized invariant-based protocol for single-qubit state preparation under arbitrary noise conditions. The control proceeds in two-stages: first, we construct a family of bounded pulses that achieve perfect state preparation in a closed system; second, we identify the optimal member of this family that minimizes the effect of noise. The framework accommodates both (i) characterized noise, enabling noise-aware control synthesis, and (ii) uncharacterized noise, where a noise-agnostic variant preserves robustness without requiring a master-equation description. Numerical simulations demonstrate high-fidelity state preparation across diverse targets while producing smooth, hardware-feasible control fields. This singularity-free framework extends invariant-based control to realistic open-system regimes, providing a versatile route toward robust quantum state engineering on NISQ hardware and other platforms exhibiting non-Markovian dynamics.
☆ Transfer Learning for Benign Overfitting in High-Dimensional Linear Regression NeurIPS 2025
Transfer learning is a key component of modern machine learning, enhancing the performance of target tasks by leveraging diverse data sources. Simultaneously, overparameterized models such as the minimum-$\ell_2$-norm interpolator (MNI) in high-dimensional linear regression have garnered significant attention for their remarkable generalization capabilities, a property known as benign overfitting. Despite their individual importance, the intersection of transfer learning and MNI remains largely unexplored. Our research bridges this gap by proposing a novel two-step Transfer MNI approach and analyzing its trade-offs. We characterize its non-asymptotic excess risk and identify conditions under which it outperforms the target-only MNI. Our analysis reveals free-lunch covariate shift regimes, where leveraging heterogeneous data yields the benefit of knowledge transfer at limited cost. To operationalize our findings, we develop a data-driven procedure to detect informative sources and introduce an ensemble method incorporating multiple informative Transfer MNIs. Finite-sample experiments demonstrate the robustness of our methods to model and data heterogeneity, confirming their advantage.
comment: 42 pages, 4 figures, 2 tables, 1 algorithm; camera-ready version accepted at NeurIPS 2025 (Spotlight)
☆ Backdoor or Manipulation? Graph Mixture of Experts Can Defend Against Various Graph Adversarial Attacks
Extensive research has highlighted the vulnerability of graph neural networks (GNNs) to adversarial attacks, including manipulation, node injection, and the recently emerging threat of backdoor attacks. However, existing defenses typically focus on a single type of attack, lacking a unified approach to simultaneously defend against multiple threats. In this work, we leverage the flexibility of the Mixture of Experts (MoE) architecture to design a scalable and unified framework for defending against backdoor, edge manipulation, and node injection attacks. Specifically, we propose an MI-based logic diversity loss to encourage individual experts to focus on distinct neighborhood structures in their decision processes, thus ensuring a sufficient subset of experts remains unaffected under perturbations in local structures. Moreover, we introduce a robustness-aware router that identifies perturbation patterns and adaptively routes perturbed nodes to corresponding robust experts. Extensive experiments conducted under various adversarial settings demonstrate that our method consistently achieves superior robustness against multiple graph adversarial attacks.
☆ On the Generalization Properties of Learning the Random Feature Models with Learnable Activation Functions
This paper studies the generalization properties of a recently proposed kernel method, the Random Feature models with Learnable Activation Functions (RFLAF). By applying a data-dependent sampling scheme for generating features, we provide by far the sharpest bounds on the required number of features for learning RFLAF in both the regression and classification tasks. We provide a unified theorem that describes the complexity of the feature number $s$, and discuss the results for the plain sampling scheme and the data-dependent leverage weighted scheme. Through weighted sampling, the bound on $s$ in the MSE loss case is improved from $\Omega(1/\epsilon^2)$ to $\tilde{\Omega}((1/\epsilon)^{1/t})$ in general $(t\geq 1)$, and even to $\Omega(1)$ when the Gram matrix has a finite rank. For the Lipschitz loss case, the bound is improved from $\Omega(1/\epsilon^2)$ to $\tilde{\Omega}((1/\epsilon^2)^{1/t})$. To learn the weighted RFLAF, we also propose an algorithm to find an approximate kernel and then apply the leverage weighted sampling. Empirical results show that the weighted RFLAF achieves the same performances with a significantly fewer number of features compared to the plainly sampled RFLAF, validating our theories and the effectiveness of this method.
☆ Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.
☆ DFCA: Decentralized Federated Clustering Algorithm
Clustered Federated Learning has emerged as an effective approach for handling heterogeneous data across clients by partitioning them into clusters with similar or identical data distributions. However, most existing methods, including the Iterative Federated Clustering Algorithm (IFCA), rely on a central server to coordinate model updates, which creates a bottleneck and a single point of failure, limiting their applicability in more realistic decentralized learning settings. In this work, we introduce DFCA, a fully decentralized clustered FL algorithm that enables clients to collaboratively train cluster-specific models without central coordination. DFCA uses a sequential running average to aggregate models from neighbors as updates arrive, providing a communication-efficient alternative to batch aggregation while maintaining clustering performance. Our experiments on various datasets demonstrate that DFCA outperforms other decentralized algorithms and performs comparably to centralized IFCA, even under sparse connectivity, highlighting its robustness and practicality for dynamic real-world decentralized networks.
☆ Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning ICDM
Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.
comment: 8 pages, ICDM Workshop
☆ Identifying internal patterns in (1+1)-dimensional directed percolation using neural networks
In this paper we present a neural network-based method for the automatic detection of phase transitions and classification of hidden percolation patterns in a (1+1)-dimensional replication process. The proposed network model is based on the combination of CNN, TCN and GRU networks, which are trained directly on raw configurations without any manual feature extraction. The network reproduces the phase diagram and assigns phase labels to configurations. It shows that deep architectures are capable of extracting hierarchical structures from the raw data of numerical experiments.
comment: 7 pages, 10 figures, 2 tables
☆ Small Ensemble-based Data Assimilation: A Machine Learning-Enhanced Data Assimilation Method with Limited Ensemble Size
Ensemble-based data assimilation (DA) methods have become increasingly popular due to their inherent ability to address nonlinear dynamic problems. However, these methods often face a trade-off between analysis accuracy and computational efficiency, as larger ensemble sizes required for higher accuracy also lead to greater computational cost. In this study, we propose a novel machine learning-based data assimilation approach that combines the traditional ensemble Kalman filter (EnKF) with a fully connected neural network (FCNN). Specifically, our method uses a relatively small ensemble size to generate preliminary yet suboptimal analysis states via EnKF. A FCNN is then employed to learn and predict correction terms for these states, thereby mitigating the performance degradation induced by the limited ensemble size. We evaluate the performance of our proposed EnKF-FCNN method through numerical experiments involving Lorenz systems and nonlinear ocean wave field simulations. The results consistently demonstrate that the new method achieves higher accuracy than traditional EnKF with the same ensemble size, while incurring negligible additional computational cost. Moreover, the EnKF-FCNN method is adaptable to diverse applications through coupling with different models and the use of alternative ensemble-based DA methods.
☆ Post-Processing Methods for Improving Accuracy in MRI Inpainting
Magnetic Resonance Imaging (MRI) is the primary imaging modality used in the diagnosis, assessment, and treatment planning for brain pathologies. However, most automated MRI analysis tools, such as segmentation and registration pipelines, are optimized for healthy anatomies and often fail when confronted with large lesions such as tumors. To overcome this, image inpainting techniques aim to locally synthesize healthy brain tissues in tumor regions, enabling the reliable application of general-purpose tools. In this work, we systematically evaluate state-of-the-art inpainting models and observe a saturation in their standalone performance. In response, we introduce a methodology combining model ensembling with efficient post-processing strategies such as median filtering, histogram matching, and pixel averaging. Further anatomical refinement is achieved via a lightweight U-Net enhancement stage. Comprehensive evaluation demonstrates that our proposed pipeline improves the anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models. By combining established models with targeted post-processing, we achieve improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable, resource-conscious research. Our 2025 BraTS inpainting docker is available at https://hub.docker.com/layers/aparida12/brats2025/inpt.
☆ Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition NeurIPS 2025
Foundation models (FMs), such as GPT-4 and AlphaFold, are reshaping the landscape of scientific research. Beyond accelerating tasks such as hypothesis generation, experimental design, and result interpretation, they prompt a more fundamental question: Are FMs merely enhancing existing scientific methodologies, or are they redefining the way science is conducted? In this paper, we argue that FMs are catalyzing a transition toward a new scientific paradigm. We introduce a three-stage framework to describe this evolution: (1) Meta-Scientific Integration, where FMs enhance workflows within traditional paradigms; (2) Hybrid Human-AI Co-Creation, where FMs become active collaborators in problem formulation, reasoning, and discovery; and (3) Autonomous Scientific Discovery, where FMs operate as independent agents capable of generating new scientific knowledge with minimal human intervention. Through this lens, we review current applications and emerging capabilities of FMs across existing scientific paradigms. We further identify risks and future directions for FM-enabled scientific discovery. This position paper aims to support the scientific community in understanding the transformative role of FMs and to foster reflection on the future of scientific discovery. Our project is available at https://github.com/usail-hkust/Awesome-Foundation-Models-for-Scientific-Discovery.
comment: NeurIPS 2025
☆ Foresighted Online Policy Optimization with Interference
Contextual bandits, which leverage the baseline features of sequentially arriving individuals to optimize cumulative rewards while balancing exploration and exploitation, are critical for online decision-making. Existing approaches typically assume no interference, where each individual's action affects only their own reward. Yet, such an assumption can be violated in many practical scenarios, and the oversight of interference can lead to short-sighted policies that focus solely on maximizing the immediate outcomes for individuals, which further results in suboptimal decisions and potentially increased regret over time. To address this significant gap, we introduce the foresighted online policy with interference (FRONT) that innovatively considers the long-term impact of the current decision on subsequent decisions and rewards. The proposed FRONT method employs a sequence of exploratory and exploitative strategies to manage the intricacies of interference, ensuring robust parameter inference and regret minimization. Theoretically, we establish a tail bound for the online estimator and derive the asymptotic distribution of the parameters of interest under suitable conditions on the interference network. We further show that FRONT attains sublinear regret under two distinct definitions, capturing both the immediate and consequential impacts of decisions, and we establish these results with and without statistical inference. The effectiveness of FRONT is further demonstrated through extensive simulations and a real-world application to urban hotel profits.
Semi-Supervised Regression with Heteroscedastic Pseudo-Labels NeurIPS 2025
Pseudo-labeling is a commonly used paradigm in semi-supervised learning, yet its application to semi-supervised regression (SSR) remains relatively under-explored. Unlike classification, where pseudo-labels are discrete and confidence-based filtering is effective, SSR involves continuous outputs with heteroscedastic noise, making it challenging to assess pseudo-label reliability. As a result, naive pseudo-labeling can lead to error accumulation and overfitting to incorrect labels. To address this, we propose an uncertainty-aware pseudo-labeling framework that dynamically adjusts pseudo-label influence from a bi-level optimization perspective. By jointly minimizing empirical risk over all data and optimizing uncertainty estimates to enhance generalization on labeled data, our method effectively mitigates the impact of unreliable pseudo-labels. We provide theoretical insights and extensive experiments to validate our approach across various benchmark SSR datasets, and the results demonstrate superior robustness and performance compared to existing methods. Our code is available at https://github.com/sxq/Heteroscedastic-Pseudo-Labels.
comment: Accepted by NeurIPS 2025
☆ Causal Time Series Modeling of Supraglacial Lake Evolution in Greenland under Distribution Shift ICML
Causal modeling offers a principled foundation for uncovering stable, invariant relationships in time-series data, thereby improving robustness and generalization under distribution shifts. Yet its potential is underutilized in spatiotemporal Earth observation, where models often depend on purely correlational features that fail to transfer across heterogeneous domains. We propose RIC-TSC, a regionally-informed causal time-series classification framework that embeds lag-aware causal discovery directly into sequence modeling, enabling both predictive accuracy and scientific interpretability. Using multi-modal satellite and reanalysis data-including Sentinel-1 microwave backscatter, Sentinel-2 and Landsat-8 optical reflectance, and CARRA meteorological variables-we leverage Joint PCMCI+ (J-PCMCI+) to identify region-specific and invariant predictors of supraglacial lake evolution in Greenland. Causal graphs are estimated globally and per basin, with validated predictors and their time lags supplied to lightweight classifiers. On a balanced benchmark of 1000 manually labeled lakes from two contrasting melt seasons (2018-2019), causal models achieve up to 12.59% higher accuracy than correlation-based baselines under out-of-distribution evaluation. These results show that causal discovery is not only a means of feature selection but also a pathway to generalizable and mechanistically grounded models of dynamic Earth surface processes.
comment: Accepted as full paper in ICMLA 2025 (Special Session 1: Deep Learning and Applications)
☆ Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($\mu$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $\mu$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $\sqrt{\eta/\lambda}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $\sqrt{\eta/\lambda}\cdot d^{0.75}$. Combining this observation with the $\mu$P learning-rate rule $\eta_2\propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $\lambda_2\propto \sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $\eta_1=\Theta_d(1)$ and $\lambda_1=0$, this yields \emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $\mu$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.
☆ DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models ICLR 2026
Large language models are highly sensitive to prompt wording. However, popular automatic prompt search methods, including InstructZero, often degrade under distribution shift and adversarial evaluation because they optimize expected performance under a single evaluation distribution. Consequently, prompts that work in one setting frequently fail to transfer. To address this, DRO-InstructZero formulates zero-shot prompt optimization as robust Bayesian optimization. Specifically, an f-divergence ball defines an ambiguity set around the evaluation distribution, and a robust acquisition rule maximizes worst-case expected utility while retaining the query efficiency of Bayesian search. Therefore, the search explicitly targets reliability under distribution shift rather than average behavior alone. Experiments follow the instruction-induction protocol with matched query budgets across formality rewriting, code debugging, and translation. For example, on BIG-Bench informative-to-formal rewriting, accuracy improves from 61.3 +/- 0.7% to approximately 85-90%, yielding an absolute gain of about 25-30 points. Moreover, auto-debugging shows about +25-point gains under domain shift. Meanwhile, stable tasks such as cause-and-effect remain above 96%, indicating no loss on in-distribution cases. Furthermore, improvements are consistent across divergence choices and decoding temperatures. Overall, DRO-InstructZero connects distributionally robust optimization with prompt learning, offering a plug-and-play and general approach for reliable, transferable prompt alignment under real-world uncertainty.
comment: Preprint. Under review at ICLR 2026. 11 pages, 2 figures
☆ Minimisation of Submodular Functions Using Gaussian Zeroth-Order Random Oracles
We consider the minimisation problem of submodular functions and investigate the application of a zeroth-order method to this problem. The method is based on exploiting a Gaussian smoothing random oracle to estimate the smoothed function gradient. We prove the convergence of the algorithm to a global $\epsilon$-approximate solution in the offline case and show that the algorithm is Hannan-consistent in the online case with respect to static regret. Moreover, we show that the algorithm achieves $O(\sqrt{NP_N^\ast})$ dynamic regret, where $N$ is the number of iterations and $P_N^\ast$ is the path length. The complexity analysis and hyperparameter selection are presented for all the cases. The theoretical results are illustrated via numerical examples.
☆ Spatiotemporal Transformers for Predicting Avian Disease Risk from Migration Trajectories
Accurate forecasting of avian disease outbreaks is critical for wildlife conservation and public health. This study presents a Transformer-based framework for predicting the disease risk at the terminal locations of migratory bird trajectories. We integrate multi-source datasets, including GPS tracking data from Movebank, outbreak records from the World Organisation for Animal Health (WOAH), and geospatial context from GADM and Natural Earth. The raw coordinates are processed using H3 hierarchical geospatial encoding to capture spatial patterns. The model learns spatiotemporal dependencies from bird movement sequences to estimate endpoint disease risk. Evaluation on a held-out test set demonstrates strong predictive performance, achieving an accuracy of 0.9821, area under the ROC curve (AUC) of 0.9803, average precision (AP) of 0.9299, and an F1-score of 0.8836 at the optimal threshold. These results highlight the potential of Transformer architectures to support early-warning systems for avian disease surveillance, enabling timely intervention and prevention strategies.
☆ Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning
Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM's embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM --> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.
comment: Under Submission
☆ Dual-Weighted Reinforcement Learning for Generative Preference Modeling
Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models on tasks with verifiable answers. However, extending RL to more general non-verifiable tasks, typically in the format of human preference pairs, remains both challenging and underexplored. In this work, we propose Dual-Weighted Reinforcement Learning (DWRL), a new framework for preference modeling that integrates CoT reasoning with the Bradley-Terry (BT) model via a dual-weighted RL objective that preserves preference-modeling inductive bias. DWRL approximates the maximum-likelihood objective of the BT model with two complementary weights: an instance-wise misalignment weight, which emphasizes under-trained pairs misaligned with human preference, and a group-wise (self-normalized) conditional preference score, which promotes promising thoughts. In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score. Across multiple benchmarks and model scales (Llama3 and Qwen2.5), DWRL consistently outperforms both GPM baselines and scalar models, while producing coherent, interpretable thoughts. In summary, our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.
☆ HOB: A Holistically Optimized Bidding Strategy under Heterogeneous Auction Mechanisms with Organic Traffic
The E-commerce advertising platforms typically sell commercial traffic through either second-price auction (SPA) or first-price auction (FPA). SPA was historically prevalent due to its dominant strategy incentive-compatible (DSIC) for bidders with quasi-linear utilities, especially when budgets are not a binding constraint, while FPA has gained more prominence for offering higher revenue potential to publishers and avoiding the possibility for discriminatory treatment in personalized reserve prices. Meanwhile, on the demand side, advertisers are increasingly adopting platform-wide marketing solutions akin to QuanZhanTui, shifting from spending budgets solely on commercial traffic to bidding on the entire traffic for the purpose of maximizing overall sales. For automated bidding systems, such a trend poses a critical challenge: determining optimal strategies across heterogeneous auction channels to fulfill diverse advertiser objectives, such as maximizing return (MaxReturn) or meeting target return on ad spend (TargetROAS). To overcome this challenge, this work makes two key contributions. First, we derive an efficient solution for optimal bidding under FPA channels, which takes into account the presence of organic traffic - traffic can be won for free. Second, we introduce a marginal cost alignment (MCA) strategy that provably secures bidding efficiency across heterogeneous auction mechanisms. To validate performance of our developed framework, we conduct comprehensive offline experiments on public datasets and large-scale online A/B testing, which demonstrate consistent improvements over existing methods.
☆ Adaptive Individual Uncertainty under Out-Of-Distribution Shift with Expert-Routed Conformal Prediction
Reliable, informative, and individual uncertainty quantification (UQ) remains missing in current ML community. This hinders the effective application of AI/ML to risk-sensitive domains. Most methods either fail to provide coverage on new data, inflate intervals so broadly that they are not actionable, or assign uncertainties that do not track actual error, especially under a distribution shift. In high-stakes drug discovery, protein-ligand affinity (PLI) prediction is especially challenging as assay noise is heterogeneous, chemical space is imbalanced and large, and practical evaluations routinely involve distribution shift. In this work, we introduce a novel uncertainty quantification method, Trustworthy Expert Split-conformal with Scaled Estimation for Efficient Reliable Adaptive intervals (TESSERA), that provides per-sample uncertainty with reliable coverage guarantee, informative and adaptive prediction interval widths that track the absolute error. We evaluate on protein-ligand binding affinity prediction under both independent and identically distributed (i.i.d.) and scaffold-based out-of-distribution (OOD) splits, comparing against strong UQ baselines. TESSERA attains near-nominal coverage and the best coverage-width trade-off as measured by the Coverage-Width Criterion (CWC), while maintaining competitive adaptivity (lowest Area Under the Sparsification Error (AUSE)). Size-Stratified Coverage (SSC) further confirms that intervals are right-sized, indicating width increases when data are scarce or noisy, and remain tight when predictions are reliable. By unifying Mixture of Expert (MoE) diversity with conformal calibration, TESSERA delivers trustworthy, tight, and adaptive uncertainties that are well-suited to selective prediction and downstream decision-making in the drug-discovery pipeline and other applications.
☆ FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs' trustworthiness evaluation in finance domain.
comment: EMNLP 2025 Main
☆ Stress-Aware Learning under KL Drift via Trust-Decayed Mirror Descent
We study sequential decision-making under distribution drift. We propose entropy-regularized trust-decay, which injects stress-aware exponential tilting into both belief updates and mirror-descent decisions. On the simplex, a Fenchel-dual equivalence shows that belief tilt and decision tilt coincide. We formalize robustness via fragility (worst-case excess risk in a KL ball), belief bandwidth (radius sustaining a target excess), and a decision-space Fragility Index (drift tolerated at $O(\sqrt{T})$ regret). We prove high-probability sensitivity bounds and establish dynamic-regret guarantees of $\tilde{O}(\sqrt{T})$ under KL-drift path length $S_T = \sum_{t\ge2}\sqrt{{\rm KL}(D_t|D_{t-1})/2}$. In particular, trust-decay achieves $O(1)$ per-switch regret, while stress-free updates incur $\Omega(1)$ tails. A parameter-free hedge adapts the tilt to unknown drift, whereas persistent over-tilting yields an $\Omega(\lambda^2 T)$ stationary penalty. We further obtain calibrated-stress bounds and extensions to second-order updates, bandit feedback, outliers, stress variation, distributed optimization, and plug-in KL-drift estimation. The framework unifies dynamic-regret analysis, distributionally robust objectives, and KL-regularized control within a single stress-adaptive update.
☆ WELD: A Large-Scale Longitudinal Dataset of Emotional Dynamics for Ubiquitous Affective Computing
Automated emotion recognition in real-world workplace settings remains a challenging problem in affective computing due to the scarcity of large-scale, longitudinal datasets collected in naturalistic environments. We present a novel dataset comprising 733,651 facial expression records from 38 employees collected over 30.5 months (November 2021 to May 2024) in an authentic office environment. Each record contains seven emotion probabilities (neutral, happy, sad, surprised, fear, disgusted, angry) derived from deep learning-based facial expression recognition, along with comprehensive metadata including job roles, employment outcomes, and personality traits. The dataset uniquely spans the COVID-19 pandemic period, capturing emotional responses to major societal events including the Shanghai lockdown and policy changes. We provide 32 extended emotional metrics computed using established affective science methods, including valence, arousal, volatility, predictability, inertia, and emotional contagion strength. Technical validation demonstrates high data quality through successful replication of known psychological patterns (weekend effect: +192% valence improvement, p < 0.001; diurnal rhythm validated) and perfect predictive validity for employee turnover (AUC=1.0). Baseline experiments using Random Forest and LSTM models achieve 91.2% accuracy for emotion classification and R2 = 0.84 for valence prediction. This is the largest and longest longitudinal workplace emotion dataset publicly available, enabling research in emotion recognition, affective dynamics modeling, emotional contagion, turnover prediction, and emotion-aware system design.
comment: 15 pages, 4 figures, 1 table. Dataset publicly available under CC BY 4.0 license
☆ Integrating Product Coefficients for Improved 3D LiDAR Data Classification (Part II)
This work extends our previous study on enhancing 3D LiDAR point-cloud classification with product coefficients \cite{medina2025integratingproductcoefficientsimproved}, measure-theoretic descriptors that complement the original spatial Lidar features. Here, we show that combining product coefficients with an autoencoder representation and a KNN classifier delivers consistent performance gains over both PCA-based baselines and our earlier framework. We also investigate the effect of adding product coefficients level by level, revealing a clear trend: richer sets of coefficients systematically improve class separability and overall accuracy. The results highlight the value of combining hierarchical product-coefficient features with autoencoders to push LiDAR classification performance further.
comment: 16 pages, 6 figures, 5 tables
☆ Machine Learning for Early Detection of Meningitis: Stacked Ensemble Learning with EHR data
We utilized a cohort of 214 meningitis patients and 46,303 non-meningitis patients from the MIMIC-III database. After extensive data preprocessing, which included ICD-based cohort selection, one-hot encoding of coding, and a two-stage feature selection process (for both the training set and the testing sets), clinically relevant features such as gender and high-risk ICD codes (including subarachnoid hemorrhage, secondary malignant neoplasm of the brain, and generalized epilepsy) are selected. Overall, these clinically reasonable and temporally adherent features provided excellent modeling performance. Three models (Random Forest, LightGBM, and Deep Neural Networks (DNN) are trained as base models for Ensemble Learning. Base model outputs are aggregated and stacked into a meta model (Logistic Regression) that uses the base model outputs as input values in training. Ultimately, soldier outputs (AUC of Testing Set 1: 0.9637, AUC of Testing Set 2: 0.9472) are obtained through ensemble learning. We created a challenging condition for diagnosing meningitis, simulating a real-world ER (Emergency Room) scenario to enhance clinical use in real-world applications. While directly deploying a diagnostic tool that clinicians can use is challenging, this paper paves the way for a potential future AI-driven diagnostic approach for meningitis using Ensemble Learning.
☆ Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025
The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year's program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of "Explainability, Interpretability, and Transparency," "Uncertainty, Bias, and Fairness," "Causality," "Domain Adaptation," "Foundation Models," "Learning from Small Medical Data," "Multimodal Methods," and "Scalable, Translational Healthcare Solutions."
☆ Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential
Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses ("if-then" rules) built from features extracted from the LLM's latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules' soundness levels, becoming highly distinct for "strict" versus "noisy" rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL's predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model's reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model's internal mechanisms for selecting/designing stronger base models.
comment: Pre-print
☆ How to Sell High-Dimensional Data Optimally
Motivated by the problem of selling large, proprietary data, we consider an information pricing problem proposed by Bergemann et al. that involves a decision-making buyer and a monopolistic seller. The seller has access to the underlying state of the world that determines the utility of the various actions the buyer may take. Since the buyer gains greater utility through better decisions resulting from more accurate assessments of the state, the seller can therefore promise the buyer supplemental information at a price. To contend with the fact that the seller may not be perfectly informed about the buyer's private preferences (or utility), we frame the problem of designing a data product as one where the seller designs a revenue-maximizing menu of statistical experiments. Prior work by Cai et al. showed that an optimal menu can be found in time polynomial in the state space, whereas we observe that the state space is naturally exponential in the dimension of the data. We propose an algorithm which, given only sampling access to the state space, provably generates a near-optimal menu with a number of samples independent of the state space. We then analyze a special case of high-dimensional Gaussian data, showing that (a) it suffices to consider scalar Gaussian experiments, (b) the optimal menu of such experiments can be found efficiently via a semidefinite program, and (c) full surplus extraction occurs if and only if a natural separation condition holds on the set of potential preferences of the buyer.
☆ ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
The ability of large language models (LLMs) to follow user instructions is central to their reliability, safety, and usefulness. While prior studies assess instruction adherence in the model's main responses, we argue that it is also critical for large reasoning models (LRMs) to follow user instructions throughout their reasoning process. Reasoning instruction following makes LRMs more controllable and transparent, while reducing risks of undesirable shortcuts, hallucinations, or reward hacking within reasoning traces. To evaluate this dimension, we introduce ReasonIF, a systematic benchmark for assessing reasoning instruction following. ReasonIF includes six categories of instruction prompts, spanning multilingual reasoning, formatting and length control. Across many open-source LRMs including GPT-OSS, Qwen3, and DeepSeek-R1, we find substantial failures in reasoning instruction adherence: the highest instruction following score (IFS) remains below 0.25, meaning that fewer than $25\%$ of reasoning traces comply with the given instructions. Notably, as task difficulty increases, reasoning instruction following degrades further. We also explore two strategies to enhance reasoning instruction fidelity. (1) multi-turn reasoning and (2) Reasoning Instruction Finetuning (RIF) using synthetic data. RIF improves the IFS of $GPT-OSS-20B$ from 0.11 to 0.27, indicating measurable progress but leaving ample room for improvement.
☆ Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection
Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren't universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model's OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.
☆ Automotive Crash Dynamics Modeling Accelerated with Machine Learning
Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.
☆ Small Ensemble-based Data Assimilation: A Machine Learning-Enhanced Data Assimilation Method with Limited Ensemble Size
Ensemble-based data assimilation (DA) methods have become increasingly popular due to their inherent ability to address nonlinear dynamic problems. However, these methods often face a trade-off between analysis accuracy and computational efficiency, as larger ensemble sizes required for higher accuracy also lead to greater computational cost. In this study, we propose a novel machine learning-based data assimilation approach that combines the traditional ensemble Kalman filter (EnKF) with a fully connected neural network (FCNN). Specifically, our method uses a relatively small ensemble size to generate preliminary yet suboptimal analysis states via EnKF. A FCNN is then employed to learn and predict correction terms for these states, thereby mitigating the performance degradation induced by the limited ensemble size. We evaluate the performance of our proposed EnKF-FCNN method through numerical experiments involving Lorenz systems and nonlinear ocean wave field simulations. The results consistently demonstrate that the new method achieves higher accuracy than traditional EnKF with the same ensemble size, while incurring negligible additional computational cost. Moreover, the EnKF-FCNN method is adaptable to diverse applications through coupling with different models and the use of alternative ensemble-based DA methods.
♻ ☆ REX: Causal discovery based on machine learning and explainability techniques
Explainable Artificial Intelligence (XAI) techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive the causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce ReX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that ReX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, ReX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase ReX's effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, ReX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures.
comment: 22 pages, 30 figures, Published in Elsevier's Pattern Recognition
♻ ☆ Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval
Associative memory models, such as Hopfield networks and their modern variants, have garnered renewed interest due to advancements in memory capacity and connections with self-attention in transformers. In this work, we introduce a unified framework-Hopfield-Fenchel-Young networks-which generalizes these models to a broader family of energy functions. Our energies are formulated as the difference between two Fenchel-Young losses: one, parameterized by a generalized entropy, defines the Hopfield scoring mechanism, while the other applies a post-transformation to the Hopfield output. By utilizing Tsallis and norm entropies, we derive end-to-end differentiable update rules that enable sparse transformations, uncovering new connections between loss margins, sparsity, and exact retrieval of single memory patterns. We further extend this framework to structured Hopfield networks using the SparseMAP transformation, allowing the retrieval of pattern associations rather than a single pattern. Our framework unifies and extends traditional and modern Hopfield networks and provides an energy minimization perspective for widely used post-transformations like $\ell_2$-normalization and layer normalization-all through suitable choices of Fenchel-Young losses and by using convex analysis as a building block. Finally, we validate our Hopfield-Fenchel-Young networks on diverse memory recall tasks, including free and sequential recall. Experiments on simulated data, image retrieval, multiple instance learning, and text rationalization demonstrate the effectiveness of our approach.
comment: 51 pages, 14 figures. arXiv admin note: text overlap with arXiv:2402.13725
♻ ☆ State Your Intention to Steer Your Attention: An AI Assistant for Intentional Digital Living
When working on digital devices, people often face distractions that can lead to a decline in productivity and efficiency, as well as negative psychological and emotional impacts. To address this challenge, we introduce a novel Artificial Intelligence (AI) assistant that elicits a user's intention, assesses whether ongoing activities are in line with that intention, and provides gentle nudges when deviations occur. The system leverages a large language model to analyze screenshots, application titles, and URLs, issuing notifications when behavior diverges from the stated goal. Its detection accuracy is refined through initial clarification dialogues and continuous user feedback. In a three-week, within-subjects field deployment with 22 participants, we compared our assistant to both a rule-based intent reminder system and a passive baseline that only logged activity. Results indicate that our AI assistant effectively supports users in maintaining focus and aligning their digital behavior with their intentions. Our source code is publicly available at https://intentassistant.github.io
comment: Corrected a typo in authors' name and added acknowledgments
♻ ☆ Beat Tracking as Object Detection
Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.
comment: 11 pages, 4 figures, 5 tables
♻ ☆ Incentive-Based Federated Learning: Architectural Elements and Future Directions
Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.
comment: 24 pages, 5 figures, chapter for edited book (Federated Learning: Foundations and Applications)
♻ ☆ Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.
♻ ☆ CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations
Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:https://causal-verse.github.io/, Dataset:https://huggingface.co/CausalVerse.
♻ ☆ PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features
LLM-integrated applications are vulnerable to prompt injection attacks, where an attacker contaminates the input to inject malicious prompts, causing the LLM to follow the attacker's intent instead of the original user's. Existing prompt injection detection methods often have sub-optimal performance and/or high computational overhead. In this work, we propose PIShield, a detection method that is both effective and efficient. Our key observation is that the internal representation of the final token in a prompt-extracted from a specific layer of the LLM, which we term the injection-critical layer-captures distinguishing features between clean and contaminated prompts. Leveraging this insight, we train a simple linear classifier on these internal representations using a labeled set of clean and contaminated prompts. We compare PIShield against 11 baselines across 5 diverse benchmark datasets and 8 prompt injection attacks. The results demonstrate that PIShield is both highly effective and efficient, substantially outperforming existing methods. Additionally, we show that PIShield resists strong adaptive attacks.
comment: The code is available at https://github.com/weizou52/PIShield
♻ ☆ Improving Intrusion Detection with Domain-Invariant Representation Learning in Latent Space
Zero-day anomaly detection is critical in industrial applications where novel, unforeseen threats can compromise system integrity and safety. Traditional detection systems often fail to identify these unseen anomalies due to their reliance on in-distribution data. Domain generalization addresses this gap by leveraging knowledge from multiple known domains to detect out-of-distribution events. In this work, we introduce a multi-task representation learning technique that fuses information across related domains into a unified latent space. By jointly optimizing classification, reconstruction, and mutual information regularization losses, our method learns a minimal(bottleneck), domain-invariant representation that discards spurious correlations. This latent space decorrelation enhances generalization, enabling the detection of anomalies in unseen domains. Our experimental results demonstrate significant improvements in zero-day or novel anomaly detection across diverse anomaly detection datasets.
♻ ☆ FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose ChainEval, a dynamic alignment metric that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Evaluating 26 leading LLMs reveals that even frontier proprietary systems exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.
comment: 18 pages, includes figures and tables; introduces the FinChain benchmark and ChainEval metric
♻ ☆ Optimistic Query Routing in Clustering-based Approximate Maximum Inner Product Search
Clustering-based nearest neighbor search is an effective method in which points are partitioned into geometric shards to form an index, with only a few shards searched during query processing to find a set of top-$k$ vectors. Even though the search efficacy is heavily influenced by the algorithm that identifies the shards to probe, it has received little attention in the literature. This work bridges that gap by studying routing in clustering-based maximum inner product search. We unpack existing routers and notice the surprising contribution of optimism. We then take a page from the sequential decision making literature and formalize that insight following the principle of ``optimism in the face of uncertainty.'' In particular, we present a framework that incorporates the moments of the distribution of inner products within each shard to estimate the maximum inner product. We then present an instance of our algorithm that uses only the first two moments to reach the same accuracy as state-of-the-art routers such as ScaNN by probing up to $50\%$ fewer points on benchmark datasets. Our algorithm is also space-efficient: we design a sketch of the second moment whose size is independent of the number of points and requires $\mathcal{O}(1)$ vectors per shard.
♻ ☆ MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.
♻ ☆ Learning to Interpret Weight Differences in Language Models
Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes ("weight diffs") are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.
comment: Project code and links to weight diffs, adapters, and training data can be found at https://github.com/Aviously/diff-interpretation-tuning
♻ ☆ Understanding Generalization in Node and Link Prediction
Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored MPNNs' generalization in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d.\@ assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework to analyze the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task under the inductive or transductive setting. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs' generalization capabilities in these tasks.
comment: arXiv admin note: text overlap with arXiv:2412.07106
♻ ☆ Light-Weight Diffusion Multiplier and Uncertainty Quantification for Fourier Neural Operators
Operator learning is a powerful paradigm for solving partial differential equations, with Fourier Neural Operators serving as a widely adopted foundation. However, FNOs face significant scalability challenges due to overparameterization and offer no native uncertainty quantification -- a key requirement for reliable scientific and engineering applications. Instead, neural operators rely on post hoc UQ methods that ignore geometric inductive biases. In this work, we introduce DINOZAUR: a diffusion-based neural operator parametrization with uncertainty quantification. Inspired by the structure of the heat kernel, DINOZAUR replaces the dense tensor multiplier in FNOs with a dimensionality-independent diffusion multiplier that has a single learnable time parameter per channel, drastically reducing parameter count and memory footprint without compromising predictive performance. By defining priors over those time parameters, we cast DINOZAUR as a Bayesian neural operator to yield spatially correlated outputs and calibrated uncertainty estimates. Our method achieves competitive or superior performance across several PDE benchmarks while providing efficient uncertainty quantification.
♻ ☆ DeepRV: Accelerating spatiotemporal inference with pre-trained neural priors
Gaussian Processes (GPs) provide a flexible and statistically principled foundation for modelling spatiotemporal phenomena, but their $O(N^3)$ scaling makes them intractable for large datasets. Approximate methods such as variational inference (VI), inducing points (sparse GPs), low-rank factorizations (RFFs), local factorizations and approximations (INLA), improve scalability but trade off accuracy or flexibility. We introduce DeepRV, a neural-network surrogate that closely matches full GP accuracy including hyperparameter estimates, while reducing computational complexity to $O(N^2)$, increasing scalability and inference speed. DeepRV serves as a drop-in replacement for GP prior realisations in e.g. MCMC-based probabilistic programming pipelines, preserving full model flexibility. Across simulated benchmarks, non-separable spatiotemporal GPs, and a real-world application to education deprivation in London (n = 4,994 locations), DeepRV achieves the highest fidelity to exact GPs while substantially accelerating inference. Code is provided in the accompanying ZIP archive, with all experiments run on a single consumer-grade GPU to ensure accessibility for practitioners.
comment: Code to reproduce all experiments is available in the ancillary file deep_rv_code, with instructions in the included README
♻ ☆ Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $\{0,1\}$ during training. This choice carries a cost: it introduces \textit{false negatives} (rejecting correct answers, FNs) and \textit{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $\frac{12}{36}$ as wrong when compared against the canonical $\frac{1}{3}$ due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a \textit{backward} correction that de-biases the observed binary reward to recover an \textit{unbiased} estimator of the clean policy gradient. The second is a \textit{forward} correction that reweights score-function terms so that the expected update direction aligns with the \textit{clean gradient}; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.
♻ ☆ How Sparse Can We Prune A Deep Network: A Fundamental Limit Perspective
Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus enabling us to characterize the sharp phase transition point, which can be regarded as the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. Moreover, we provide efficient countermeasures to address the challenges in the computation of the pruning limit, which mainly involves the accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can also provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed which demonstrate that our theoretical pruning ratio threshold coincides very well with the experiments. All codes are available at: https://github.com/QiaozheZhang/Global-One-shot-Pruning
♻ ☆ Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields NeurIPS 2025
Machine-learned force fields (MLFFs), especially pre-trained foundation models, are transforming computational materials science by enabling ab initio-level accuracy at molecular dynamics scales. Yet their rapid rise raises a key question: should researchers train specialist models from scratch, fine-tune generalist foundation models, or use hybrid approaches? The trade-offs in data efficiency, accuracy, cost, and robustness to out-of-distribution failure remain unclear. We introduce a benchmarking framework using defect migration pathways, evaluated through nudged elastic band trajectories, as diagnostic probes that test both interpolation and extrapolation. Using Cr-doped Sb2Te3 as a representative two-dimensional material, we benchmark multiple training paradigms within the MACE architecture across equilibrium, kinetic (atomic migration), and mechanical (interlayer sliding) tasks. Fine-tuned models substantially outperform from-scratch and zero-shot approaches for kinetic properties but show partial loss of long-range physics. Representational analysis reveals distinct, non-overlapping latent encodings, indicating that different training strategies learn different aspects of system physics. This framework provides practical guidelines for MLFF development and establishes migration-based probes as efficient diagnostics linking performance to learned representations, guiding future uncertainty-aware active learning.
comment: 10 pages (+Appendix 22 pages), 8 figures. To appear in 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: AI for Accelerated Materials Discovery (AI4Mat) Workshop. Code and datasets available at https://github.com/yicao-elina/MigrationBench.git
♻ ☆ Onboard Mission Replanning for Adaptive Cooperative Multi-Robot Systems
Cooperative autonomous robotic systems have significant potential for executing complex multi-task missions across space, air, ground, and maritime domains. But they commonly operate in remote, dynamic and hazardous environments, requiring rapid in-mission adaptation without reliance on fragile or slow communication links to centralised compute. Fast, on-board replanning algorithms are therefore needed to enhance resilience. Reinforcement Learning shows strong promise for efficiently solving mission planning tasks when formulated as Travelling Salesperson Problems (TSPs), but existing methods: 1) are unsuitable for replanning, where agents do not start at a single location; 2) do not allow cooperation between agents; 3) are unable to model tasks with variable durations; or 4) lack practical considerations for on-board deployment. Here we define the Cooperative Mission Replanning Problem as a novel variant of multiple TSP with adaptations to overcome these issues, and develop a new encoder/decoder-based model using Graph Attention Networks and Attention Models to solve it effectively and efficiently. Using a simple example of cooperative drones, we show our replanner consistently (90% of the time) maintains performance within 10% of the state-of-the-art LKH3 heuristic solver, whilst running 85-370 times faster on a Raspberry Pi. This work paves the way for increased resilience in autonomous multi-agent systems.
comment: 9 pages, 5 figures, 1 table
♻ ☆ Implicit neural representations for accurate estimation of the standard model of white matter
Diffusion magnetic resonance imaging (dMRI) enables non-invasive investigation of tissue microstructure. The Standard Model (SM) of white matter aims to disentangle dMRI signal contributions from intra- and extra-axonal water compartments. However, due to the model its high-dimensional nature, accurately estimating its parameters poses a complex problem and remains an active field of research, in which different (machine learning) strategies have been proposed. This work introduces an estimation framework based on implicit neural representations (INRs), which incorporate spatial regularization through the sinusoidal encoding of the input coordinates. The INR method is evaluated on both synthetic and in vivo datasets and compared to existing methods. Results demonstrate superior accuracy of the INR method in estimating SM parameters, particularly in low signal-to-noise conditions. Additionally, spatial upsampling of the INR can represent the underlying dataset anatomically plausibly in a continuous way. The INR is self-supervised, eliminating the need for labeled training data. It achieves fast inference, is robust to noise, supports joint estimation of SM kernel parameters and the fiber orientation distribution function with spherical harmonics orders up to at least 8, and accommodates gradient non-uniformity corrections. The combination of these properties positions INRs as a potentially important tool for analyzing and interpreting diffusion MRI data.
♻ ☆ Predicting gene essentiality and drug response from perturbation screens in preclinical cancer models with LEAP: Layered Ensemble of Autoencoders and Predictors
High-throughput preclinical perturbation screens, where the effects of genetic, chemical, or environmental perturbations are systematically tested on disease models, hold significant promise for machine learning-enhanced drug discovery due to their scale and causal nature. Predictive models trained on such datasets can be used to (i) infer perturbation response for previously untested disease models, and (ii) characterise the biological context that affects perturbation response. Existing predictive models suffer from limited reproducibility, generalisability and interpretability. To address these issues, we introduce a framework of Layered Ensemble of Autoencoders and Predictors (LEAP), a general and flexible ensemble strategy to aggregate predictions from multiple regressors trained using diverse gene expression representation models. LEAP consistently improves prediction performances in unscreened cell lines across modelling strategies. In particular, LEAP applied to perturbation-specific LASSO regressors (PS-LASSO) provides a favorable balance between near state-of-the-art performance and low computation time. We also propose an interpretability approach combining model distillation and stability selection to identify important biological pathways for perturbation response prediction in LEAP. Our models have the potential to accelerate the drug discovery pipeline by guiding the prioritisation of preclinical experiments and providing insights into the biological mechanisms involved in perturbation response. The code and datasets used in this work are publicly available.
♻ ☆ Retro3D: A 3D-aware Template-free Method for Enhancing Retrosynthesis via Molecular Conformer Information
Retrosynthesis plays a crucial role in the fields of organic synthesis and drug development, where the goal is to identify suitable reactants that can yield a target product molecule. Although existing methods have achieved notable success, they typically overlook the 3D conformational details and internal spatial organization of molecules. This oversight makes it challenging to predict reactants that conform to genuine chemical principles, particularly when dealing with complex molecular structures, such as polycyclic and heteroaromatic compounds. In response to this challenge, we introduce a novel transformer-based, template-free approach that incorporates 3D conformer data and spatial information. Our approach includes an Atom-align Fusion module that integrates 3D positional data at the input stage, ensuring correct alignment between atom tokens and their respective 3D coordinates. Additionally, we propose a Distance-weighted Attention mechanism that refines the self-attention process, constricting the model s focus to relevant atom pairs in 3D space. Extensive experiments on the USPTO-50K dataset demonstrate that our model outperforms previous template-free methods, setting a new benchmark for the field. A case study further highlights our method s ability to predict reasonable and accurate reactants.
♻ ☆ LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling
The development of accurate machine learning interatomic potentials (MLIPs) is limited by the fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets derived from Density Functional Theory (DFT). These datasets are expensive to generate yet difficult to combine due to variations in format, metadata, and accessibility. To address this, we introduce LeMat-Traj, a curated dataset comprising over 120 million atomic configurations aggregated from large-scale repositories, including the Materials Project, Alexandria, and OQMD. LeMat-Traj standardizes data representation, harmonizes results and filters for high-quality configurations across widely used DFT functionals (PBE, PBESol, SCAN, r2SCAN). It significantly lowers the barrier for training transferrable and accurate MLIPs. LeMat-Traj spans both relaxed low-energy states and high-energy, high-force structures, complementing molecular dynamics and active learning datasets. By fine-tuning models pre-trained on high-force data with LeMat-Traj, we achieve a significant reduction in force prediction errors on relaxation tasks. We also present LeMaterial-Fetcher, a modular and extensible open-source library developed for this work, designed to provide a reproducible framework for the community to easily incorporate new data sources and ensure the continued evolution of large-scale materials datasets. LeMat-Traj and LeMaterial-Fetcher are publicly available at https://huggingface.co/datasets/LeMaterial/LeMat-Traj and https://github.com/LeMaterial/lematerial-fetcher.
♻ ☆ All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on said dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, (1) it is relatively easy to learn the relatively simple RM (verifier) from the preference data. Then, (2) the downstream RL procedure only returns policies (generators) that are optimal for such relatively simple verifiers. Thus, end-to-end, two-stage online FT only has to search over a reduced subset of the full space of policies, requiring less data than offline FT.
♻ ☆ Bayesian Optimization with Preference Exploration using a Monotonic Neural Network Ensemble
Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.
♻ ☆ Operationalizing Automated Essay Scoring: A Human-Aware Approach
This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.
♻ ☆ Euclidean Distance Matrix Completion via Asymmetric Projected Gradient Descent
This paper proposes and analyzes a gradient-type algorithm based on Burer-Monteiro factorization, called the Asymmetric Projected Gradient Descent (APGD), for reconstructing the point set configuration from partial Euclidean distance measurements, known as the Euclidean Distance Matrix Completion (EDMC) problem. By paralleling the incoherence matrix completion framework, we show for the first time that global convergence guarantee with exact recovery of this routine can be established given $\mathcal{O}(\mu^2 r^3 \kappa^2 n \log n)$ Bernoulli random observations without any sample splitting. Unlike leveraging the tangent space Restricted Isometry Property (RIP) and local curvature of the low-rank embedding manifold in some very recent works, our proof provides extra upper bounds that act as analogies of the random graph lemma under EDMC setting. The APGD works surprisingly well and numerical experiments demonstrate exact linear convergence behavior in rich-sample regions yet deteriorates rapidly when compared with the performance obtained by optimizing the s-stress function, i.e., the standard but unexplained non-convex approach for EDMC, if the sample size is limited. While virtually matching our theoretical prediction, this unusual phenomenon might indicate that: (i) the power of implicit regularization is weakened when specified in the APGD case; (ii) the stabilization of such new gradient direction requires substantially more samples than the information-theoretic limit would suggest.
♻ ☆ Theory of periodic convolutional neural network
We introduce a novel convolutional neural network architecture, termed the \emph{periodic CNN}, which incorporates periodic boundary conditions into the convolutional layers. Our main theoretical contribution is a rigorous approximation theorem: periodic CNNs can approximate ridge functions depending on $d-1$ linear variables in a $d$-dimensional input space, while such approximation is impossible in lower-dimensional ridge settings ($d-2$ or fewer variables). This result establishes a sharp characterization of the expressive power of periodic CNNs. Beyond the theory, our findings suggest that periodic CNNs are particularly well-suited for problems where data naturally admits a ridge-like structure of high intrinsic dimension, such as image analysis on wrapped domains, physics-informed learning, and materials science. The work thus both expands the mathematical foundation of CNN approximation theory and highlights a class of architectures with surprising and practically relevant approximation capabilities.
Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise
Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t\!\to\!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.
♻ ☆ Flow Matching for Robust Simulation-Based Inference under Model Misspecification
Simulation-based inference (SBI) is transforming experimental sciences by enabling parameter estimation in complex non-linear models from simulated data. A persistent challenge, however, is model misspecification: simulators are only approximations of reality, and mismatches between simulated and real data can yield biased or overconfident posteriors. We address this issue by introducing Flow Matching Corrected Posterior Estimation (FMCPE), a framework that leverages the flow matching paradigm to refine simulation-trained posterior estimators using a small set of real calibration samples. Our approach proceeds in two stages: first, a posterior approximator is trained on abundant simulated data; second, flow matching transports its predictions toward the true posterior supported by real observations, without requiring explicit knowledge of the misspecification. This design enables FMCPE to combine the scalability of SBI with robustness to distributional shift. Across synthetic benchmarks and real-world datasets, we show that our proposal consistently mitigates the effects of misspecification, delivering improved inference accuracy and uncertainty calibration compared to standard SBI baselines, while remaining computationally efficient.
♻ ☆ Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound
Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.
comment: 57 pages, 17 figures
♻ ☆ CoUn: Empowering Machine Unlearning via Contrastive Learning
Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.
♻ ☆ Neural Diffusion Processes for Physically Interpretable Survival Prediction
We introduce DeepFHT, a survival-analysis framework that couples deep neural networks with first hitting time (FHT) distributions from stochastic process theory. Time to event is represented as the first passage of a latent diffusion process to an absorbing boundary. A neural network maps input variables to physically meaningful parameters including initial condition, drift, and diffusion, within a chosen FHT process such as Brownian motion, both with drift and driftless. This yields closed- form survival and hazard functions and captures time-varying risk without assuming proportional- hazards. We compare DeepFHT with Cox regression using synthetic and real-world datasets. The method achieves predictive accuracy on par with the state-of-the-art approach, while maintaining a physics- based interpretable parameterization that elucidates the relation between input features and risk. This combination of stochastic process theory and deep learning provides a principled avenue for modeling survival phenomena in complex systems
comment: 12 pages, 5 figures
♻ ☆ RadioDiff-$k^2$: Helmholtz Equation Informed Generative Diffusion Model for Multi-Path Aware Radio Map Construction
In this paper, we propose a novel physics-informed generative learning approach, named RadioDiff-$k^2$, for accurate and efficient multipath-aware radio map (RM) construction. As future wireless communication evolves towards environment-aware paradigms, the accurate construction of RMs becomes crucial yet highly challenging. Conventional electromagnetic (EM)-based methods, such as full-wave solvers and ray-tracing approaches, exhibit substantial computational overhead and limited adaptability to dynamic scenarios. Although existing neural network (NN) approaches have efficient inferencing speed, they lack sufficient consideration of the underlying physics of EM wave propagation, limiting their effectiveness in accurately modeling critical EM singularities induced by complex multipath environments. To address these fundamental limitations, we propose a novel physics-inspired RM construction method guided explicitly by the Helmholtz equation, which inherently governs EM wave propagation. Specifically, based on the analysis of partial differential equations (PDEs), we theoretically establish a direct correspondence between EM singularities, which correspond to the critical spatial features influencing wireless propagation, and regions defined by negative wave numbers in the Helmholtz equation. We then design an innovative dual diffusion model (DM)-based large artificial intelligence framework comprising one DM dedicated to accurately inferring EM singularities and another DM responsible for reconstructing the complete RM using these singularities along with environmental contextual information. Experimental results demonstrate that the proposed RadioDiff-$k^2$ framework achieves state-of-the-art (SOTA) performance in both image-level RM construction and localization tasks, while maintaining inference latency within a few hundred milliseconds.
♻ ☆ Reasoning-Enhanced Large Language Models for Molecular Property Prediction
Molecular property prediction is crucial for drug discovery and materials science, yet existing approaches suffer from limited interpretability, poor cross-task generalization, and lack of chemical reasoning capabilities. Traditional machine learning models struggle with task transferability, while specialized molecular language models provide little insight into their decision-making processes. To address these limitations, we propose \textbf{MPPReasoner}, a multimodal large language model that incorporates chemical reasoning for molecular property prediction. Our approach, built upon Qwen2.5-VL-7B-Instruct, integrates molecular images with SMILES strings to enable comprehensive molecular understanding. We develop a two-stage training strategy: supervised fine-tuning (SFT) using 16,000 high-quality reasoning trajectories generated through expert knowledge and multiple teacher models, followed by Reinforcement Learning from Principle-Guided Rewards (RLPGR). RLPGR employs verifiable, rule-based rewards that systematically evaluate chemical principle application, molecular structure analysis, and logical consistency through computational verification. Extensive experiments across 8 datasets demonstrate significant performance improvements, with MPPReasoner outperforming the best baselines by 7.91\% and 4.53\% on in-distribution and out-of-distribution tasks respectively. MPPReasoner exhibits exceptional cross-task generalization and generates chemically sound reasoning paths that provide valuable insights into molecular property analysis, substantially enhancing both interpretability and practical utility for chemists. Code is available at https://anonymous.4open.science/r/MPPReasoner-12687.
♻ ☆ Expanding the Action Space of LLMs to Reason Beyond Language
Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.
♻ ☆ msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML NeurIPS 2025
AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.
comment: NeurIPS 2025 (poster)
♻ ☆ Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning NeurIPS2025
We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.
comment: NeurIPS2025
♻ ☆ Lookup multivariate Kolmogorov-Arnold Networks
High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce a general-purpose drop-in replacement, lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0x while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10x higher H100 throughput at equal accuracy. Within frameworks of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6-2.1x and by 1.7x on the CIFAR-10 and ImageNet-1k datasets, respectively. Our code, including dedicated CUDA kernels, is available online at https://github.com/schwallergroup/lmkan.
comment: polishing
♻ ☆ Traces Propagation: Memory-Efficient and Scalable Forward-Only Learning in Spiking Neural Networks
Spiking Neural Networks (SNNs) provide an efficient framework for processing dynamic spatio-temporal signals and for investigating the learning principles underlying biological neural systems. A key challenge in training SNNs is to solve both spatial and temporal credit assignment. The dominant approach for training SNNs is Backpropagation Through Time (BPTT) with surrogate gradients. However, BPTT is in stark contrast with the spatial and temporal locality observed in biological neural systems and leads to high computational and memory demands, limiting efficient training strategies and on-device learning. Although existing local learning rules achieve local temporal credit assignment by leveraging eligibility traces, they fail to address the spatial credit assignment without resorting to auxiliary layer-wise matrices, which increase memory overhead and hinder scalability, especially on embedded devices. In this work, we propose Traces Propagation (TP), a forward-only, memory-efficient, scalable, and fully local learning rule that combines eligibility traces with a layer-wise contrastive loss without requiring auxiliary layer-wise matrices. TP outperforms other fully local learning rules on NMNIST and SHD datasets. On more complex datasets such as DVS-GESTURE and DVS-CIFAR10, TP showcases competitive performance and scales effectively to deeper SNN architectures such as VGG-9, while providing favorable memory scaling compared to prior fully local scalable rules, for datasets with a significant number of classes. Finally, we show that TP is well suited for practical fine-tuning tasks, such as keyword spotting on the Google Speech Commands dataset, thus paving the way for efficient learning at the edge.
♻ ☆ Finetuning and Quantization of EEG-Based Foundational BioSignal Models on ECG and PPG Data for Blood Pressure Estimation
Blood pressure (BP) is a key indicator of cardiovascular health. As hypertension remains a global cause of morbidity and mortality, accurate, continuous, and non-invasive BP monitoring is therefore of paramount importance. Photoplethysmography (PPG) and electrocardiography (ECG) can potentially enable continuous BP monitoring, yet training accurate and robust machine learning (ML) models remains challenging due to variability in data quality and patient-specific factors. Recently, multiple research groups explored Electroencephalographic (EEG)--based foundation models and demonstrated their exceptional ability to learn rich temporal resolution. Considering the morphological similarities between different biosignals, the question arises of whether a model pre-trained on one modality can effectively be exploited to improve the accuracy of a different signal type. In this work, we take an initial step towards generalized biosignal foundation models by investigating whether model representations learned from abundant EEG data can effectively be transferred to ECG/PPG data solely with fine-tuning, without the need for large-scale additional pre-training, for the BP estimation task. Evaluations on the MIMIC-III and VitalDB datasets demonstrate that our approach achieves near state-of-the-art accuracy for diastolic BP (mean absolute error of 1.57 mmHg) and surpasses by 1.5x the accuracy of prior works for systolic BP (mean absolute error 2.72 mmHg). Additionally, we perform dynamic INT8 quantization, reducing the smallest model size by over 3.5x (from 13.73 MB down to 3.83 MB) while preserving performance, thereby enabling unobtrusive, real-time BP monitoring on resource-constrained wearable devices.
comment: 7 pages, 1 figure, 5 tables, Camera-ready, EMBC 2025
♻ ☆ FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model
Accurate and efficient electroencephalography (EEG) analysis is essential for detecting seizures and artifacts in long-term monitoring, with applications spanning hospital diagnostics to wearable health devices. Robust EEG analytics have the potential to greatly improve patient care. However, traditional deep learning models, especially Transformer-based architectures, are hindered by their quadratic time and memory complexity, making them less suitable for resource-constrained environments. To address these challenges, we present FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel self-supervised framework that establishes new efficiency benchmarks for EEG analysis through bidirectional state-space modeling. Unlike Transformer-based models, which incur quadratic time and memory complexity, FEMBA scales linearly with sequence length, enabling more scalable and efficient processing of extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and fine-tuned on three downstream tasks, FEMBA achieves competitive performance in comparison with transformer models, with significantly lower computational cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates viability for resource-constrained devices. These results pave the way for scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as a promising candidate for wearable applications.
comment: 7 pages, 3 figures, 5 tables, Camera-ready version, EMBC 2025
♻ ☆ High-Probability Analysis of Online and Federated Zero-Order Optimisation
We study distributed learning in the context of gradient-free zero-order optimisation and introduce FedZero, a federated zero-order algorithm with sharp theoretical guarantees. Our contributions are threefold. First, in the federated convex setting, we derive high-probability guarantees for regret minimisation achieved by FedZero. Second, in the single-worker regime, corresponding to the classical zero-order framework with two-point feedback, we establish the first high-probability convergence guarantees for convex zero-order optimisation, strengthening previous results that held only in expectation. Third, to establish these guarantees, we develop novel concentration tools: (i) concentration inequalities with explicit constants for Lipschitz functions under the uniform measure on the $\ell_1$-sphere, and (ii) a time-uniform concentration inequality for squared sub-Gamma random variables. These probabilistic results underpin our high-probability guarantees and may also be of independent interest.
♻ ☆ FinHEAR: Human Expertise and Adaptive Risk-Aware Temporal Reasoning for Financial Decision-Making
Financial decision-making presents unique challenges for language models, demanding temporal reasoning, adaptive risk assessment, and responsiveness to dynamic events. While large language models (LLMs) show strong general reasoning capabilities, they often fail to capture behavioral patterns central to human financial decisions-such as expert reliance under information asymmetry, loss-averse sensitivity, and feedback-driven temporal adjustment. We propose FinHEAR, a multi-agent framework for Human Expertise and Adaptive Risk-aware reasoning. FinHEAR orchestrates specialized LLM-based agents to analyze historical trends, interpret current events, and retrieve expert-informed precedents within an event-centric pipeline. Grounded in behavioral economics, it incorporates expert-guided retrieval, confidence-adjusted position sizing, and outcome-based refinement to enhance interpretability and robustness. Empirical results on curated financial datasets show that FinHEAR consistently outperforms strong baselines across trend prediction and trading tasks, achieving higher accuracy and better risk-adjusted returns.
♻ ☆ Meta-learning of Gibbs states for many-body Hamiltonians with applications to Quantum Boltzmann Machines
The preparation of quantum Gibbs states is a fundamental challenge in quantum computing, essential for applications ranging from modeling open quantum systems to quantum machine learning. Building on the Meta-Variational Quantum Eigensolver framework proposed by Cervera-Lierta et al.(2021) and a problem driven ansatz design, we introduce two meta-learning algorithms: Meta-Variational Quantum Thermalizer (Meta-VQT) and Neural Network Meta-VQT (NN-Meta VQT) for efficient thermal state preparation of parametrized Hamiltonians on Noisy Intermediate-Scale Quantum (NISQ) devices. Meta-VQT utilizes a fully quantum ansatz, while NN Meta-VQT integrates a quantum classical hybrid architecture. Both leverage collective optimization over training sets to generalize Gibbs state preparation to unseen parameters. We validate our methods on upto 8-qubit Transverse Field Ising Model and the 2-qubit Heisenberg model with all field terms, demonstrating efficient thermal state generation beyond training data. For larger systems, we show that our meta-learned parameters when combined with appropriately designed ansatz serve as warm start initializations, significantly outperforming random initializations in the optimization tasks. Furthermore, a 3- qubit Kitaev ring example showcases our algorithm's effectiveness across finite-temperature crossover regimes. Finally, we apply our algorithms to train a Quantum Boltzmann Machine (QBM) on a 2-qubit Heisenberg model with all field terms, achieving enhanced training efficiency, improved Gibbs state accuracy, and a 30-fold runtime speedup over existing techniques such as variational quantum imaginary time (VarQITE)-based QBM highlighting the scalability and practicality of meta-algorithm-based QBMs.
comment: 20 pages, 14 figures, 3 tables, 3 algorithms
♻ ☆ Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing
Large language models (LLMs) fine-tuning shows excellent implications. However, vanilla fine-tuning methods often require intricate data mixture and repeated experiments for optimal generalization. To address these challenges and streamline the training process, we propose an efficient and universal solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through zero-learning-rate training on general data, which is subsequently employed for gradient boosting and dynamic training step correction during domain training. In conjunction with annealing learning, we end up establishing a fine-tuning pipeline that relies solely on domain data without collapse. By evaluating both general and domain-specific performance across multiple tasks on several popular base models, DBA achieves an average improvement of 5.8% in joint performance over vanilla fine-tuning. Furthermore, since general data is no longer involved in annealing, repeated experiments led by data mixture are also eliminated. According to our tests, the DBA method can reduce GPU hours by 91.0% compared to the vanilla method.
comment: 9 pages, 5 figures
♻ ☆ When In Doubt, Abstain: The Impact of Abstention on Strategic Classification
Algorithmic decision making is increasingly prevalent, but often vulnerable to strategic manipulation by agents seeking a favorable outcome. Prior research has shown that classifier abstention (allowing a classifier to decline making a decision due to insufficient confidence) can significantly increase classifier accuracy. This paper studies abstention within a strategic classification context, exploring how its introduction impacts strategic agents' responses and how principals should optimally leverage it. We model this interaction as a Stackelberg game where a principal, acting as the classifier, first announces its decision policy, and then strategic agents, acting as followers, manipulate their features to receive a desired outcome. Here, we focus on binary classifiers where agents manipulate observable features rather than their true features, and show that optimal abstention ensures that the principal's utility (or loss) is no worse than in a non-abstention setting, even in the presence of strategic agents. We also show that beyond improving accuracy, abstention can also serve as a deterrent to manipulation, making it costlier for agents, especially those less qualified, to manipulate to achieve a positive outcome when manipulation costs are significant enough to affect agent behavior. These results highlight abstention as a valuable tool for reducing the negative effects of strategic behavior in algorithmic decision making systems.
♻ ☆ When Does Closeness in Distribution Imply Representational Similarity? An Identifiability Perspective
When and why representations learned by different deep neural networks are similar is an active research topic. We choose to address these questions from the perspective of identifiability theory, which suggests that a measure of representational similarity should be invariant to transformations that leave the model distribution unchanged. Focusing on a model family which includes several popular pre-training approaches, e.g., autoregressive language models, we explore when models which generate distributions that are close have similar representations. We prove that a small Kullback--Leibler divergence between the model distributions does not guarantee that the corresponding representations are similar. This has the important corollary that models with near-maximum data likelihood can still learn dissimilar representations -- a phenomenon mirrored in our experiments with models trained on CIFAR-10. We then define a distributional distance for which closeness implies representational similarity, and in synthetic experiments, we find that wider networks learn distributions which are closer with respect to our distance and have more similar representations. Our results thus clarify the link between closeness in distribution and representational similarity.
♻ ☆ Skull-stripping induces shortcut learning in MRI-based Alzheimer's disease classification
Objectives: High classification accuracy of Alzheimer's disease (AD) from structural MRI has been achieved using deep neural networks, yet the specific image features contributing to these decisions remain unclear. In this study, the contributions of T1-weighted (T1w) gray-white matter texture, volumetric information, and preprocessing -- particularly skull-stripping -- were systematically assessed. Methods: A dataset of 990 matched T1w MRIs from AD patients and cognitively normal controls from the ADNI database were used. Preprocessing was varied through skull-stripping and intensity binarization to isolate texture and shape contributions. A 3D convolutional neural network was trained on each configuration, and classification performance was compared using exact McNemar tests with discrete Bonferroni-Holm correction. Feature relevance was analyzed using Layer-wise Relevance Propagation, image similarity metrics, and spectral clustering of relevance maps. Results: Despite substantial differences in image content, classification accuracy, sensitivity, and specificity remained stable across preprocessing conditions. Models trained on binarized images preserved performance, indicating minimal reliance on gray-white matter texture. Instead, volumetric features -- particularly brain contours introduced through skull-stripping -- were consistently used by the models. Conclusions: This behavior reflects a shortcut learning phenomenon, where preprocessing artifacts act as potentially unintended cues. The resulting Clever Hans effect emphasizes the critical importance of interpretability tools to reveal hidden biases and to ensure robust and trustworthy deep learning in medical imaging.
♻ ☆ Ascent Fails to Forget NeurIPS 2025
Contrary to common belief, we show that gradient ascent-based unconstrained optimization methods frequently fail to perform machine unlearning, a phenomenon we attribute to the inherent statistical dependence between the forget and retain data sets. This dependence, which can manifest itself even as simple correlations, undermines the misconception that these sets can be independently manipulated during unlearning. We provide empirical and theoretical evidence showing these methods often fail precisely due to this overlooked relationship. For random forget sets, this dependence means that degrading forget set metrics (which, for a retrained model, should mirror test set metrics) inevitably harms overall test performance. Going beyond random sets, we consider logistic regression as an instructive example where a critical failure mode emerges: inter-set dependence causes gradient descent-ascent iterations to progressively diverge from the ideal retrained model. Strikingly, these methods can converge to solutions that are not only far from the retrained ideal but are potentially even further from it than the original model itself, rendering the unlearning process actively detrimental. A toy example further illustrates how this dependence can trap models in inferior local minima, inescapable via finetuning. Our findings highlight that the presence of such statistical dependencies, even when manifest only as correlations, can be sufficient for ascent-based unlearning to fail. Our theoretical insights are corroborated by experiments on complex neural networks, demonstrating that these methods do not perform as expected in practice due to this unaddressed statistical interplay.
comment: NeurIPS 2025
♻ ☆ MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark
Recent advances in generative models have led to their application in password guessing, with the aim of replicating the complexity, structure, and patterns of human-created passwords. Despite their potential, inconsistencies and inadequate evaluation methodologies in prior research have hindered meaningful comparisons and a comprehensive, unbiased understanding of their capabilities. This paper introduces MAYA, a unified, customizable, plug-and-play benchmarking framework designed to facilitate the systematic characterization and benchmarking of generative password-guessing models in the context of trawling attacks. Using MAYA, we conduct a comprehensive assessment of six state-of-the-art approaches, which we re-implemented and adapted to ensure standardization. Our evaluation spans eight real-world password datasets and covers an exhaustive set of advanced testing scenarios, totaling over 15,000 compute hours. Our findings indicate that these models effectively capture different aspects of human password distribution and exhibit strong generalization capabilities. However, their effectiveness varies significantly with long and complex passwords. Through our evaluation, sequential models consistently outperform other generative architectures and traditional password-guessing tools, demonstrating unique capabilities in generating accurate and complex guesses. Moreover, the diverse password distributions learned by the models enable a multi-model attack that outperforms the best individual model. By releasing MAYA, we aim to foster further research, providing the community with a new tool to consistently and reliably benchmark generative password-guessing models. Our framework is publicly available at https://github.com/williamcorrias/MAYA-Password-Benchmarking.
comment: Paper accepted at the 47th IEEE Symposium on Security and Privacy (S&P 2026)
♻ ☆ Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
comment: 21 pages
♻ ☆ Rethinking Robustness in Machine Learning: A Posterior Agreement Approach
The robustness of algorithms against covariate shifts is a fundamental problem with critical implications for the deployment of machine learning algorithms in the real world. Current evaluation methods predominantly measure robustness through the lens of standard generalization, relying on task performance measures like accuracy. This approach lacks a theoretical justification and underscores the need for a principled foundation of robustness assessment under distribution shifts. In this work, we set the desiderata for a robustness measure, and we propose a novel principled framework for the robustness assessment problem that directly follows the Posterior Agreement (PA) theory of model validation. Specifically, we extend the PA framework to the covariate shift setting and propose a measure for robustness evaluation. We assess the soundness of our measure in controlled environments and through an empirical robustness analysis in two different covariate shift scenarios: adversarial learning and domain generalization. We illustrate the suitability of PA by evaluating several models under different nature and magnitudes of shift, and proportion of affected observations. The results show that PA offers a reliable analysis of the vulnerabilities in learning algorithms across different shift conditions and provides higher discriminability than accuracy-based measures, while requiring no supervision.
comment: Preprint submitted to TMLR. 37 pages, 18 figures
♻ ☆ The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives -- both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely -- lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.
comment: 25 pages, 6 figures
♻ ☆ Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses
The Tsetlin Machine (TM) architecture has recently demonstrated effectiveness in Machine Learning (ML), particularly within Natural Language Processing (NLP). It has been utilized to construct word embedding using conjunctive propositional clauses, thereby significantly enhancing our understanding and interpretation of machine-derived decisions. The previous approach performed the word embedding over a sequence of input words to consolidate the information into a cohesive and unified representation. However, that approach encounters scalability challenges as the input size increases. In this study, we introduce a novel approach incorporating two-phase training to discover contextual embeddings of input sequences. Specifically, this method encapsulates the knowledge for each input word within the dataset's vocabulary, subsequently constructing embeddings for a sequence of input words utilizing the extracted knowledge. This technique not only facilitates the design of a scalable model but also preserves interpretability. Our experimental findings revealed that the proposed method yields competitive performance compared to the previous approaches, demonstrating promising results in contrast to human-generated benchmarks. Furthermore, we applied the proposed approach to sentiment analysis on the IMDB dataset, where the TM embedding and the TM classifier, along with other interpretable classifiers, offered a transparent end-to-end solution with competitive performance.
♻ ☆ VLMLight: Safety-Critical Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning Architecture
Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-branch reasoning. At the core of VLMLight is the first image-based traffic simulator that enables multi-view visual perception at intersections, allowing policies to reason over rich cues such as vehicle type, motion, and spatial density. A large language model (LLM) serves as a safety-prioritized meta-controller, selecting between a fast RL policy for routine traffic and a structured reasoning branch for critical cases. In the latter, multiple LLM agents collaborate to assess traffic phases, prioritize emergency vehicles, and verify rule compliance. Experiments show that VLMLight reduces waiting times for emergency vehicles by up to 65% over RL-only systems, while preserving real-time performance in standard conditions with less than 1% degradation. VLMLight offers a scalable, interpretable, and safety-aware solution for next-generation traffic signal control.
comment: 25 pages, 15 figures
♻ ☆ Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.
♻ ☆ Tequila: Trapping-free Ternary Quantization for Large Language Models
Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.
♻ ☆ FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
♻ ☆ Interpretable Hybrid-Rule Temporal Point Processes
Temporal Point Processes (TPPs) are widely used for modeling event sequences in various medical domains, such as disease onset prediction, progression analysis, and clinical decision support. Although TPPs effectively capture temporal dynamics, their lack of interpretability remains a critical challenge. Recent advancements have introduced interpretable TPPs. However, these methods fail to incorporate numerical features, thereby limiting their ability to generate precise predictions. To address this issue, we propose Hybrid-Rule Temporal Point Processes (HRTPP), a novel framework that integrates temporal logic rules with numerical features, improving both interpretability and predictive accuracy in event modeling. HRTPP comprises three key components: basic intensity for intrinsic event likelihood, rule-based intensity for structured temporal dependencies, and numerical feature intensity for dynamic probability modulation. To effectively discover valid rules, we introduce a two-phase rule mining strategy with Bayesian optimization. To evaluate our method, we establish a multi-criteria assessment framework, incorporating rule validity, model fitting, and temporal predictive accuracy. Experimental results on real-world medical datasets demonstrate that HRTPP outperforms state-of-the-art interpretable TPPs in terms of predictive performance and clinical interpretability. In case studies, the rules extracted by HRTPP explain the disease progression, offering valuable contributions to medical diagnosis.
♻ ☆ Landmark-Based Node Representations for Shortest Path Distance Approximations in Random Graphs
Learning node representations is a fundamental problem in graph machine learning. While existing embedding methods effectively preserve local similarity measures, they often fail to capture global functions like graph distances. Inspired by Bourgain's seminal work on Hilbert space embeddings of metric spaces (1985), we study the performance of local distance-preserving node embeddings. Known as landmark-based algorithms, these embeddings approximate pairwise distances by computing shortest paths from a small subset of reference nodes called landmarks. Our main theoretical contribution shows that random graphs, such as Erdos-Renyi random graphs, require lower dimensions in landmark-based embeddings compared to worst-case graphs. Empirically, we demonstrate that the GNN-based approximations for the distances to landmarks generalize well to larger real-world networks, offering a scalable and transferable alternative for graph representation learning.
comment: Landmark-based embeddings preserve global distances in graphs more efficiently; on Erdos-Renyi random graphs they need lower dimensions, and GNNs generalize these embeddings to large real-world networks
♻ ☆ Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
comment: The code and model are available at https://github.com/M-E-AGI-Lab/Muddit
♻ ☆ Scaling Physical Reasoning with the PHYSICS Dataset NeurIPS
Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model's physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics. The code and data can be found at: https://github.com/Zhengsh123/PHYSICS.
comment: Accepted to the NeurIPS Datasets and Benchmarks Track
♻ ☆ Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand
Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce high-resolution maps that support decision-making in long-term water management.
♻ ☆ Mind the Gap: Navigating Inference with Optimal Transport Maps
Machine learning (ML) techniques have recently enabled enormous gains in sensitivity to new phenomena across the sciences. In particle physics, much of this progress has relied on excellent simulations of a wide range of physical processes. However, due to the sophistication of modern machine learning algorithms and their reliance on high-quality training samples, discrepancies between simulation and experimental data can significantly limit their effectiveness. In this work, we present a solution to this ``misspecification'' problem: a model calibration approach based on optimal transport, which we apply to high-dimensional simulations for the first time. We demonstrate the performance of our approach through jet tagging, using a dataset inspired by the CMS experiment at the Large Hadron Collider. A 128-dimensional internal jet representation from a powerful general-purpose classifier is studied; after calibrating this internal ``latent'' representation, we find that a wide variety of quantities derived from it for downstream tasks are also properly calibrated: using this calibrated high-dimensional representation, powerful new applications of jet flavor information can be utilized in LHC analyses. This is a key step toward allowing the unbiased use of ``foundation models'' in particle physics. More broadly, this calibration framework has broad applications for correcting high-dimensional simulations across the sciences.
comment: 31 pages, 13 figures
♻ ☆ GuardReasoner: Towards Reasoning-based LLM Safeguards
As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.
comment: 22 pages, 18 figures
♻ ☆ A Cycle-Consistency Constrained Framework for Dynamic Solution Space Reduction in Noninjective Regression
To address the challenges posed by the heavy reliance of multi-output models on preset probability distributions and embedded prior knowledge in non-injective regression tasks, this paper proposes a cycle consistency-based data-driven training framework. The method jointly optimizes a forward model {\Phi}: X to Y and a backward model {\Psi}: Y to X, where the cycle consistency loss is defined as L _cycleb equal L(Y reduce {\Phi}({\Psi}(Y))) (and vice versa). By minimizing this loss, the framework establishes a closed-loop mechanism integrating generation and validation phases, eliminating the need for manual rule design or prior distribution assumptions. Experiments on normalized synthetic and simulated datasets demonstrate that the proposed method achieves a cycle reconstruction error below 0.003, achieving an improvement of approximately 30% in evaluation metrics compared to baseline models without cycle consistency. Furthermore, the framework supports unsupervised learning and significantly reduces reliance on manual intervention, demonstrating potential advantages in non-injective regression tasks.
♻ ☆ Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations
Mean-field game theory relies on approximating games that are intractible to model due to a very large to infinite population of players. While these kinds of games can be solved analytically via the associated system of partial derivatives, this approach is not model-free, can lead to the loss of the existence or uniqueness of solutions, and may suffer from modelling bias. To reduce the dependency between the model and the game, we introduce neural mean-field games: a combination of mean-field game theory and deep learning in the form of neural stochastic differential equations. The resulting model is data-driven, lightweight, and can learn extensive strategic interactions that are hard to capture using mean-field theory alone. In addition, the model is based on automatic differentiation, making it more robust and objective than approaches based on finite differences. We highlight the efficiency and flexibility of our approach by solving two mean-field games that vary in their complexity, observability, and the presence of noise. Lastly, we illustrate the model's robustness by simulating viral dynamics based on real-world data. Here, we demonstrate that the model's ability to learn from real-world data helps to accurately model the evolution of an epidemic outbreak. Using these results, we show that the model is flexible, generalizable, and requires few observations to learn the distribution underlying the data.
♻ ☆ Learning Unified Representations from Heterogeneous Data for Robust Heart Rate Modeling
Heart rate prediction is vital for personalized health monitoring and fitness, while it frequently faces a critical challenge when deploying in real-world: data heterogeneity. We classify it in two key dimensions: source heterogeneity from fragmented device markets with varying feature sets, and user heterogeneity reflecting distinct physiological patterns across individuals and activities. Existing methods either discard device-specific information, or fail to model user-specific differences, limiting their real-world performance. To address this, we propose a framework that learns latent representations agnostic to both heterogeneity, enabling downstream predictors to work consistently under heterogeneous data patterns. Specifically, we introduce a random feature dropout strategy to handle source heterogeneity, making the model robust to various feature sets. To manage user heterogeneity, we employ a time-aware attention module to capture long-term physiological traits and use a contrastive learning objective to build a discriminative representation space. To reflect the heterogeneous nature of real-world data, we created and publicly released a new benchmark dataset, ParroTao. Evaluations on both ParroTao and the public FitRec dataset show that our model significantly outperforms existing baselines by 17% and 15%, respectively. Furthermore, analysis of the learned representations demonstrates their strong discriminative power, and one downstream application task confirm the practical value of our model.
♻ ☆ Soft Graph Transformer for MIMO Detection
We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.
comment: 5 pages with 3 figures and 2 tables, submitted to IEEE for a possible publication
♻ ☆ SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians
Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.
comment: For video demonstrations and additional materials please see https://nlml.github.io/sheap/
♻ ☆ Thinking Augmented Pre-training
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
comment: 19 pages; v4 fixes an issue for HumanEval scores
♻ ☆ A Multimodal Lightweight Approach to Fault Diagnosis of Induction Motors in High-Dimensional Dataset
An accurate AI-based diagnostic system for induction motors (IMs) holds the potential to enhance proactive maintenance, mitigating unplanned downtime and curbing overall maintenance costs within an industrial environment. Notably, among the prevalent faults in IMs, a Broken Rotor Bar (BRB) fault is frequently encountered. Researchers have proposed various fault diagnosis approaches using signal processing (SP), machine learning (ML), deep learning (DL), and hybrid architectures for BRB faults. One limitation in the existing literature is the training of these architectures on relatively small datasets, risking overfitting when implementing such systems in industrial environments. This paper addresses this limitation by implementing large-scale data of BRB faults by using a transfer-learning-based lightweight DL model named ShuffleNetV2 for diagnosing one, two, three, and four BRB faults using current and vibration signal data. Spectral images for training and testing are generated using a Short-Time Fourier Transform (STFT). The dataset comprises 57,500 images, with 47,500 used for training and 10,000 for testing. Remarkably, the ShuffleNetV2 model exhibited superior performance, in less computational cost as well as accurately classifying 98.856% of spectral images. To further enhance the visualization of harmonic sidebands resulting from broken bars, Fast Fourier Transform (FFT) is applied to current and vibration data. The paper also provides insights into the training and testing times for each model, contributing to a comprehensive understanding of the proposed fault diagnosis methodology. The findings of our research provide valuable insights into the performance and efficiency of different ML and DL models, offering a foundation for the development of robust fault diagnosis systems for induction motors in industrial settings.
♻ ☆ F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning NeurIPS 2025
Parameter-efficient fine-tuning (PEFT) of powerful pre-trained models for complex downstream tasks has proven effective in vision and language processing, yet this paradigm remains unexplored in scientific machine learning, where the objective is to model complex physical systems. We conduct the first systematic study of PEFT for pre-trained Large Operator Models (LOMs) obtained by scaling variants of Fourier Neural Operator. First, we observe that the widely used Low-Rank Adaptation (LoRA) yields markedly poorer performance on LOMs than Adapter tuning. Then, we further theoretically establish that stacked LoRA incurs a depth-amplified lower bound on approximation error within Fourier layers, whereas adapters retain universal approximation capacity and, by concentrating parameters on energy-dominant low-frequency modes, attain exponentially decaying error with bottleneck width in the Fourier domain. Motivated by the robust empirical gains of adapters and by our theoretical characterization of PDE solutions as spectrally sparse, we introduce Frequency-Adaptive Adapter (F-Adapter). F-Adapter allocates adapter capacity based on spectral complexity, assigning higher-dimension modules to low-frequency components and lower-dimension modules to high-frequency components. Our F-Adapters establish state-of-the-art (SOTA) results on multiple challenging 3D Navier-Stokes benchmarks, markedly enhancing both generalization and spectral fidelity over LoRA and other PEFT techniques commonly used in LLMs. To the best of our knowledge, this work is the first to explore PEFT for scientific machine-learning and establishes F-Adapter as an effective paradigm for this domain.
comment: NeurIPS 2025 Main Track
♻ ☆ Which exceptional low-dimensional projections of a Gaussian point cloud can be found in polynomial time?
Given $d$-dimensional standard Gaussian vectors $\boldsymbol{x}_1,\dots, \boldsymbol{x}_n$, we consider the set of all empirical distributions of its $m$-dimensional projections, for $m$ a fixed constant. Diaconis and Freedman (1984) proved that, if $n/d\to \infty$, all such distributions converge to the standard Gaussian distribution. In contrast, we study the proportional asymptotics, whereby $n,d\to \infty$ with $n/d\to \alpha \in (0, \infty)$. In this case, the projection of the data points along a typical random subspace is again Gaussian, but the set $\mathscr{F}_{m,\alpha}$ of all probability distributions that are asymptotically feasible as $m$-dimensional projections contains non-Gaussian distributions corresponding to exceptional subspaces. Non-rigorous methods from statistical physics yield an indirect characterization of $\mathscr{F}_{m,\alpha}$ in terms of a generalized Parisi formula. Motivated by the goal of putting this formula on a rigorous basis, and to understand whether these projections can be found efficiently, we study the subset $\mathscr{F}^{\rm alg}_{m,\alpha}\subseteq \mathscr{F}_{m,\alpha}$ of distributions that can be realized by a class of iterative algorithms. We prove that this set is characterized by a certain stochastic optimal control problem, and obtain a dual characterization of this problem in terms of a variational principle that extends Parisi's formula. As a byproduct, we obtain computationally achievable values for a class of random optimization problems including `generalized spherical perceptron' models.
comment: 72 pages, accepted to Annals of Probability
♻ ☆ PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.
comment: Code available at https://github.com/andaero/PLaID, model weights at https://huggingface.co/HOPE-Lab-HMC/PLaID
♻ ☆ Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation
Layer-wise Gaussian mechanisms (LGM) enhance flexibility in differentially private deep learning by injecting noise into partitioned gradient vectors. However, existing methods often rely on heuristic noise allocation strategies, lacking a rigorous understanding of their theoretical grounding in connecting noise allocation to formal privacy-utility tradeoffs. In this paper, we present a unified analytical framework that systematically connects layer-wise noise injection strategies with their implicit optimization objectives and associated privacy budget allocations. Our analysis reveals that several existing approaches optimize ill-posed objectives -- either ignoring inter-layer signal-to-noise ratio (SNR) consistency or leading to inefficient use of the privacy budget. In response, we propose a SNR-Consistent noise allocation strategy that unifies both aspects, yielding a noise allocation scheme that achieves better signal preservation and more efficient privacy budget utilization. Extensive experiments in both centralized and federated learning settings demonstrate that our method consistently outperforms existing allocation strategies, achieving better privacy-utility tradeoffs. Our framework not only offers diagnostic insights into prior methods but also provides theoretical guidance for designing adaptive and effective noise injection schemes in deep models.
comment: Errors were found in the experimental data preprocessing, which affected the reported results and conclusions. The paper is being revised and a corrected version will be resubmitted
♻ ☆ Conditional Generative Modeling for Enhanced Credit Risk Management in Supply Chain Finance
The rapid expansion of cross-border e-commerce (CBEC) has created significant opportunities for small- and medium-sized sellers, yet financing remains a critical challenge due to their limited credit histories. Third-party logistics (3PL)-led supply chain finance (SCF) has emerged as a promising solution, leveraging in-transit inventory as collateral. We propose an advanced credit risk management framework tailored for 3PL-led SCF, addressing the dual challenges of credit risk assessment and loan size determination. Specifically, we leverage conditional generative modeling of sales distributions through Quantile-Regression-based Generative Metamodeling (QRGMM) as the foundation for risk measures estimation. We propose a unified framework that enables flexible estimation of multiple risk measures while introducing a functional risk measure formulation that systematically captures the relationship between these risk measures and varying loan levels, supported by theoretical guarantees. To capture complex covariate interactions in e-commerce sales data, we integrate QRGMM with Deep Factorization Machines (DeepFM). Extensive experiments on synthetic and real-world data validate the efficacy of our model for credit risk assessment and loan size determination. This study explores the use of generative models in CBEC SCF risk management, illustrating their potential to strengthen credit assessment and support financing for small- and medium-sized sellers.
♻ ☆ Bayesian Ego-graph inference for Networked Multi-Agent Reinforcement Learning NeurIPS 2025
In networked multi-agent reinforcement learning (Networked-MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real-world decentralized systems. We propose a stochastic graph-based policy for Networked-MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end-to-end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision-making strategies. BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.
comment: Accepted at NeurIPS 2025
♻ ☆ Learning More with Less: A Generalizable, Self-Supervised Framework for Privacy-Preserving Capacity Estimation with EV Charging Data
Accurate battery capacity estimation is key to alleviating consumer concerns about battery performance and reliability of electric vehicles (EVs). However, practical data limitations imposed by stringent privacy regulations and labeled data shortages hamper the development of generalizable capacity estimation models that remain robust to real-world data distribution shifts. While self-supervised learning can leverage unlabeled data, existing techniques are not particularly designed to learn effectively from challenging field data -- let alone from privacy-friendly data, which are often less feature-rich and noisier. In this work, we propose a first-of-its-kind capacity estimation model based on self-supervised pre-training, developed on a large-scale dataset of privacy-friendly charging data snippets from real-world EV operations. Our pre-training framework, snippet similarity-weighted masked input reconstruction, is designed to learn rich, generalizable representations even from less feature-rich and fragmented privacy-friendly data. Our key innovation lies in harnessing contrastive learning to first capture high-level similarities among fragmented snippets that otherwise lack meaningful context. With our snippet-wise contrastive learning and subsequent similarity-weighted masked reconstruction, we are able to learn rich representations of both granular charging patterns within individual snippets and high-level associative relationships across different snippets. Bolstered by this rich representation learning, our model consistently outperforms state-of-the-art baselines, achieving 31.9% lower test error than the best-performing benchmark, even under challenging domain-shifted settings affected by both manufacturer and age-induced distribution shifts. Source code is available at https://github.com/en-research/GenEVBattery.
comment: Accepted in IEEE Transactions on Industrial Informatics
♻ ☆ Spatial Supply Repositioning with Censored Demand Data
We consider a network inventory system motivated by one-way, on-demand vehicle sharing services. Under uncertain and correlated network demand, the service operator periodically repositions vehicles to match a fixed supply with spatial customer demand while minimizing costs. Finding an optimal repositioning policy in such a general inventory network is analytically and computationally challenging. We introduce a base-stock repositioning policy as a multidimensional generalization of the classical inventory rule to $n$ locations, and we establish its asymptotic optimality under two practically relevant regimes. We present exact reformulations that enable efficient computation of the best base-stock policy in an offline setting with historical data. In the online setting, we illustrate the challenges of learning with censored data in networked systems through a regret lower bound analysis and by demonstrating the suboptimality of alternative algorithmic approaches. We propose a Surrogate Optimization and Adaptive Repositioning algorithm and prove that it attains an optimal regret of $O(n^{2.5} \sqrt{T})$, which matches the regret lower bound in $T$ with polynomial dependence on $n$. Our work highlights the critical role of inventory repositioning in the viability of shared mobility businesses and illuminates the inherent challenges posed by data and network complexity. Our results demonstrate that simple, interpretable policies, such as the state-independent base-stock policies we analyze, can provide significant practical value and achieve near-optimal performance.
comment: Paper split: part of v1 (mainly Appendix F) is now superseded by another standalone paper
♻ ☆ Recursive Gaussian Process State Space Model
Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an efficient method suitable for scenarios where prior information regarding data distribution and model function is limited. To address this issue, this paper proposes a recursive GPSSM method with adaptive capabilities for both operating domains and Gaussian process (GP) hyperparameters. Specifically, we first utilize first-order linearization to derive a Bayesian update equation for the joint distribution between the system state and the GP model, enabling closed-form and domain-independent learning. Second, an online selection algorithm for inducing points is developed based on informative criteria to achieve lightweight learning. Third, to support online hyperparameter optimization, we recover historical measurement information from the current filtering distribution. Comprehensive evaluations on both synthetic and real-world datasets demonstrate the superior accuracy, computational efficiency, and adaptability of our method compared to state-of-the-art online GPSSM techniques.
♻ ☆ Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems
Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. However, the topology of these systems--how agents in MASs should be configured, connected, and coordinated--remains largely unexplored. In this position paper, we call for a paradigm shift toward \emph{topology-aware MASs} that explicitly model and dynamically optimize the structure of inter-agent interactions. We identify three fundamental components--agents, communication links, and overall topology--that collectively determine the system's adaptability, efficiency, robustness, and fairness. To operationalize this vision, we introduce a systematic three-stage framework: 1) agent selection, 2) structure profiling, and 3) topology synthesis. This framework not only provides a principled foundation for designing MASs but also opens new research frontiers across language modeling, reinforcement learning, graph learning, and generative modeling to ultimately unleash their full potential in complex real-world applications. We conclude by outlining key challenges and opportunities in MASs evaluation. We hope our framework and perspectives offer critical new insights in the era of agentic AI.
♻ ☆ MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning
We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions. To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions. For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function. We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset. We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.
♻ ☆ Recurrent Natural Policy Gradient for POMDPs
Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.
♻ ☆ Privacy-Preserving Dataset Combination AAAI
Access to diverse, high-quality datasets is crucial for machine learning model performance, yet data sharing remains limited by privacy concerns and competitive interests, particularly in regulated domains like healthcare. This dynamic especially disadvantages smaller organizations that lack resources to purchase data or negotiate favorable sharing agreements, due to the inability to \emph{privately} assess external data's utility. To resolve privacy and uncertainty tensions simultaneously, we introduce {\SecureKL}, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. {\SecureKL} evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, {\SecureKL} achieves high consistency ($>90\%$ correlation with non-private counterparts) and successfully identifies beneficial data collaborations in highly-heterogeneous domains (ICU mortality prediction across hospitals and income prediction across states). Our results highlight that secure computation maximizes data utilization, outperforming privacy-agnostic utility assessments that leak information.
comment: 15 pages, 9 figures, AAAI-AIES 25'
♻ ☆ WebInject: Prompt Injection Attack to Web Agents
Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. In this work, we propose WebInject, a prompt injection attack that manipulates the webpage environment to induce a web agent to perform an attacker-specified action. Our attack adds a perturbation to the raw pixel values of the rendered webpage. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the attacker-specified action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple datasets shows that WebInject is highly effective and significantly outperforms baselines.
comment: Appeared in EMNLP 2025 main conference. To better understand prompt injection attacks, see https://people.duke.edu/~zg70/code/PromptInjection.pdf
♻ ☆ Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing NeurIPS 2025
Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM-Labs/denovo.
comment: Accepted by NeurIPS 2025
♻ ☆ SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling
Mixture-of-Experts (MoE) models have become a widely-adopted solution to continue scaling model sizes without a corresponding linear increase in compute. During MoE model training, each input token is dynamically routed to a subset of experts -- sparsely-activated feed-forward networks -- within each transformer layer. The distribution of tokens assigned to each expert varies widely and rapidly over the course of training. To handle the wide load imbalance across experts, current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity, incurring high state migration overheads. To break this performance-accuracy tradeoff, we introduce SYMI, an adaptive MoE training system. The key insight of SYMI is to decouple the placement of expert parameters from their large optimizer state. SYMI statically partitions the optimizer of each expert across all training nodes. Meanwhile, SYMI dynamically adjusts the placement of expert parameters by repurposing existing weight updates, avoiding migration overheads. In doing so, SYMI right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overhead. Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SYMI is able to achieve a 30.5% and 25.9% faster time-to-convergence, respectively.
♻ ☆ D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
comment: Corrected typos
♻ ☆ Hyperbolic Dataset Distillation NeurIPS 2025
To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. To the best of our knowledge, this is the first work to incorporate the hyperbolic space into the dataset distillation process. The code is available at https://github.com/Guang000/HDD.
comment: Accepted to NeurIPS 2025
♻ ☆ LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration
Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.
comment: 26 pages
♻ ☆ Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to an audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can result in unrealistic configurations or biased outcomes. We address this pitfall by introducing a Gaussian prior derived from the DiffVox vocal preset dataset over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces the parameter mean squared error by up to 33% and more closely matches the reference style. Subjective evaluations with 16 participants confirm the superiority of our method in limited data regimes. This work demonstrates how incorporating prior knowledge at inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
comment: Published at WASPAA 2025
Computer Vision and Pattern Recognition 219
☆ Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.
comment: Project page: https://coupled-diffusion.github.io
☆ From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
comment: 21 pages, 7 figures
☆ Agentic Design of Compositional Machines
The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
comment: 75 pages, 31 figures, Project Page: https://besiegefield.github.io
☆ Learning an Image Editing Model without Image Editing Pairs
Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
comment: project page: https://nupurkmr9.github.io/npedit/
☆ Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation ICCV 2025
Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.
comment: Accepted to ICCV 2025. Project page: https://stevenlsw.github.io/ponimator/
☆ Terra: Explorable Native 3D World Model with Point Latents
World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
comment: Project Page: https://huang-yh.github.io/terra/
☆ WithAnyone: Towards Controllable and ID Consistent Image Generation
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
comment: 23 Pages; Project Page: https://doby-xu.github.io/WithAnyone/; Code: https://github.com/Doby-Xu/WithAnyone
☆ pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.
comment: Code: https://github.com/Lakonik/piFlow Demos: https://huggingface.co/spaces/Lakonik/pi-Qwen and https://huggingface.co/spaces/Lakonik/pi-FLUX.1
☆ RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks NeurIPS 2025
To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025); Project Website: rdd-neurips.github.io
☆ ChangingGrounding: 3D Visual Grounding in Changing Scenes
Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .
comment: 30 pages
☆ RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion
Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.
☆ C4D: 4D Made from 3D through Dual Correspondences ICCV 2025
Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D
comment: Accepted to ICCV 2025
☆ MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/
comment: Project Page: https://mathcanvas.github.io/
☆ RealDPO: Real or Not Real, that is the Preference
Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.
comment: Code:https://github.com/Vchitect/RealDPO Project Page:https://vchitect.github.io/RealDPO-Project/
☆ OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression
Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.
☆ From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance
Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision-language-action humanoid systems.
☆ DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
☆ 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/
comment: Project page : https://cvlab-kaist.github.io/3DScenePrompt/
☆ MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
☆ Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection
Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.
☆ You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction WACV26
Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
comment: Accepted to WACV26. 12 pages, 8 tables, 5 figures
☆ ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention
Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.
☆ BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data
Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar's real-world dashcam collision dataset -- the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.
☆ TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.
☆ Benchmarking Multimodal Large Language Models for Face Recognition
Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.
☆ Multi-modal video data-pipelines for machine learning with minimal human supervision
The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.
☆ A Multi-Task Deep Learning Framework for Skin Lesion Classification, ABCDE Feature Quantification, and Evolution Simulation
Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.
☆ ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
☆ Backdoor Unlearning by Linear Task Decomposition
Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
☆ QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models
Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.
☆ Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data
AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0--a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation--based on lessons from the JHH dataset--for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.
☆ Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
☆ FraQAT: Quantization Aware Training with Fractional bits
State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model's precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7\%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).
☆ Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning
Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.
☆ Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks
Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks--detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor's size, number, appearance, and sometimes, pathology results--information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super
☆ Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images
Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.
☆ CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection
Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.
comment: 28 pages, 13 Figures, 12 Tables
☆ MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks
Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (``start'', ``end'', ``1'', ``0''). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework's effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.
☆ Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality
Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth's comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA's HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.
comment: 21 pages, 9 figures
LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement
Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.
☆ DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models NeurIPS 2025
Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier's decision process without access to training data or ground-truth labels. We demonstrate DEXTER's flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.
comment: Accepted to NeurIPS 2025 (spotlight)
☆ Free-Grained Hierarchical Recognition
Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.
comment: 26 pages
☆ Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection
Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.
☆ Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models
Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.
comment: 5 pages, accepted at AIROV2025
☆ Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery
Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. "interesting points". We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% -- from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at https://github.com/microsoft/whales.
☆ Leveraging Learned Image Prior for 3D Gaussian Compression ICCV 2025
Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.
comment: Accepted to ICCV 2025 Workshop on ECLR
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning ICCV 2025
In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: https://vtimecot.github.io
comment: Accepted by ICCV 2025
☆ WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
☆ EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024)
Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at https://github.com/EricYu97/EuroMineNet.
☆ Decorrelation Speeds Up Vision Transformers ICLR 2026
Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.
comment: 15 pages, 12 figures, submitted to ICLR 2026
☆ In-Context Learning with Unpaired Clips for Instruction-based Video Editing
Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.
☆ SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation
Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.
☆ Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
comment: Code: https://github.com/CompVis/RepTok
☆ GOPLA: Generalizable Object Placement Learning via Synthetic Augmentation of Human Arrangement
Robots are expected to serve as intelligent assistants, helping humans with everyday household organization. A central challenge in this setting is the task of object placement, which requires reasoning about both semantic preferences (e.g., common-sense object relations) and geometric feasibility (e.g., collision avoidance). We present GOPLA, a hierarchical framework that learns generalizable object placement from augmented human demonstrations. A multi-modal large language model translates human instructions and visual inputs into structured plans that specify pairwise object relationships. These plans are then converted into 3D affordance maps with geometric common sense by a spatial mapper, while a diffusion-based planner generates placement poses guided by test-time costs, considering multi-plan distributions and collision avoidance. To overcome data scarcity, we introduce a scalable pipeline that expands human placement demonstrations into diverse synthetic training data. Extensive experiments show that our approach improves placement success rates by 30.04 percentage points over the runner-up, evaluated on positioning accuracy and physical plausibility, demonstrating strong generalization across a wide range of real-world robotic placement scenarios.
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.
☆ Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding ACM MM
Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.
comment: 9 pages, 3 figures. Accepted to ACM MMSports 2025
☆ Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering NeurIPS 2025
Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF
comment: Accepted by NeurIPS 2025
☆ Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering
Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.
comment: Extended abstract. Submitted to AICC: Workshop on AI for Climate and Conservation - EurIPS 2025 (non-archival)
☆ Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers
State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from "blank" and "animal" labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent
comment: Extended abstract. Submitted to AICC: Workshop on AI for Climate and Conservation - EurIPS 2025 (non-archival)
☆ STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding
Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB \(+\) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
comment: Code, model, and demos can be found at https://envision-research.github.io/STANCE/
☆ Talking Points: Describing and Localizing Pixels
Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.
☆ CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification
This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97\% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.
comment: 10 pages, 7 figures
☆ BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU
3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44$\times$ training speedup on a NVIDIA A100 GPU with negligible quality degradation.
comment: Accepted by ASP-DAC 2026
☆ Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video NeurIPS 2025
Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/
comment: Accepted at NeurIPS 2025 (preview; camera-ready in preparation)
☆ Consistent text-to-image generation via scene de-contextualization
Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.
☆ Exploring Cross-Modal Flows for Few-Shot Learning
Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
comment: 13 pages, 6 figures
☆ Exploring Image Representation with Decoupled Classical Visual Descriptors BMVC 2025
Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: https://chenyuanqu.com/VisualSplit/.
comment: Accepted by The 36th British Machine Vision Conference (BMVC 2025)
☆ Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval
Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.
comment: 6 pages,3 figures, 3 tables. Accepted at 2025 IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2025)
☆ Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology
Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.
☆ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
☆ Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models
In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.
comment: Appendix will be appended soon
☆ Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing
Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.
☆ Vision Mamba for Permeability Prediction of Porous Media
Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.
☆ Grazing Detection using Deep Learning and Sentinel-2 Time Series Data
Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.
comment: Code and models: https://github.com/aleksispi/pib-ml-grazing
☆ Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration
Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model's optimization, effectively uncovering "winning tickets" that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.
comment: Accepted at WI-IAT 2025
☆ Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review
Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.
☆ Structured Universal Adversarial Attacks on Object Detection for Video Sequences
Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.
comment: Accepted at GCPR 2025 (German Conference on Pattern Recognition). This is a different version as submitted to the conference, not the official conference proceedings
☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
comment: 10 pages
☆ Deep Compositional Phase Diffusion for Long Motion Sequence Generation NeurIPS 2025
Recent research on motion generation has shown significant progress in generating semantically aligned motion with singular semantics. However, when employing these models to create composite sequences containing multiple semantically generated motion clips, they often struggle to preserve the continuity of motion dynamics at the transition boundaries between clips, resulting in awkward transitions and abrupt artifacts. To address these challenges, we present Compositional Phase Diffusion, which leverages the Semantic Phase Diffusion Module (SPDM) and Transitional Phase Diffusion Module (TPDM) to progressively incorporate semantic guidance and phase details from adjacent motion clips into the diffusion process. Specifically, SPDM and TPDM operate within the latent motion frequency domain established by the pre-trained Action-Centric Motion Phase Autoencoder (ACT-PAE). This allows them to learn semantically important and transition-aware phase information from variable-length motion clips during training. Experimental results demonstrate the competitive performance of our proposed framework in generating compositional motion sequences that align semantically with the input conditions, while preserving phase transitional continuity between preceding and succeeding motion clips. Additionally, motion inbetweening task is made possible by keeping the phase parameter of the input motion sequences fixed throughout the diffusion process, showcasing the potential for extending the proposed framework to accommodate various application scenarios. Codes are available at https://github.com/asdryau/TransPhase.
comment: Accepted by NeurIPS 2025 (Oral)
☆ DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis
The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.
☆ BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble WACV 2026
Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.
comment: This paper has been submitted to IEEE/CVF WACV 2026 Applications track and is currently under review
☆ DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights
Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20\% test set used by recent methods, our model achieves Dice improvements of 0.10\% for whole tumor, 1.75\% for tumor core, and 0.93\% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86\% for tumor core and 1.45\% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.
☆ DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation
Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
☆ Spatial Preference Rewarding for MLLMs Spatial Understanding ICCV 2025
Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR
comment: ICCV 2025
☆ AI for Service: Proactive Assistance with AI Glasses
In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.
comment: 24 pages, 5 figures, work in progress
☆ Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration ICRA 2024
With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.
comment: 8 pages, accepted at ICRA 2024 (International Conference on Robotics and Automation)
☆ Vision-Centric Activation and Coordination for Multimodal Large Language Models
Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
comment: Under Review
☆ A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities
Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.
☆ A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection
An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.
☆ Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding
Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.
comment: EMNLP 2025 Findings; Project: https://github.com/KR-0822/TCD
☆ Learning Human-Humanoid Coordination for Collaborative Object Carrying
Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids' complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
☆ CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts
Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.
☆ GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.
☆ Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment
We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} < 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.
☆ MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching
Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at https://github.com/TingmanYan/MatchAttention.
☆ Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
☆ Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{https://ipro-alimama.github.io/}{https://ipro-alimama.github.io/}.
☆ MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering
Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.
comment: 8 pages
☆ Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication
Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.
☆ Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation
Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.
comment: 10 pages, submitted to IEEE TMI
☆ PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis
The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA
☆ LOTA: Bit-Planes Guided AI-Generated Image Detection ICCV2025
The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9\%} (\textbf{11.9}\%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2\% from GAN to Diffusion and over 99.2\% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.
comment: Published in the ICCV2025, COde is https://github.com/hongsong-wang/LOTA
☆ Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition
This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.
comment: Accepted at APSIPA ASC 2025
☆ Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures SIGGRAPH
We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.
comment: Accepted to SIGGRAPH Asia 2025
☆ Salient Concept-Aware Generative Data Augmentation NeurIPS2025
Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.
comment: 10 pages, 4 figures, NeurIPS2025
☆ Hyperparameter Optimization and Reproducibility in Deep Learning Model Training
Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.
☆ Train a Unified Multimodal Data Quality Classifier with Synthetic Data
The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.
comment: EMNLP 2025 Findings
☆ XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.
Fourier Transform Multiple Instance Learning for Whole Slide Image Classification
Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.
☆ Deep generative priors for 3D brain analysis
Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.
☆ TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.
☆ SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images
Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: https://wrld.github.io/SaLon3R/.
☆ A solution to generalized learning from small training sets found in everyday infant experiences
Young children readily recognize and generalize visual objects labeled by common nouns, suggesting that these basic level object categories may be given. Yet if they are, how they arise remains unclear. We propose that the answer lies in the statistics of infant daily life visual experiences. Whereas large and diverse datasets typically support robust learning and generalization in human and machine learning, infants achieve this generalization from limited experiences. We suggest that the resolution of this apparent contradiction lies in the visual diversity of daily life, repeated experiences with single object instances. Analyzing egocentric images from 14 infants (aged 7 to 11 months) we show that their everyday visual input exhibits a lumpy similarity structure, with clusters of highly similar images interspersed with rarer, more variable ones, across eight early-learned categories. Computational experiments show that mimicking this structure in machines improves generalization from small datasets in machine learning. The natural lumpiness of infant experience may thus support early category learning and generalization and, more broadly, offer principles for efficient learning across a variety of problems and kinds of learners.
comment: 24 pages, 10 figures, 1 table
☆ Directional Reasoning Injection for Fine-Tuning MLLMs
Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.
comment: Project Page: https://wikichao.github.io/DRIFT/
☆ Comprehensive language-image pre-training for 3D medical image understanding
Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.
☆ Generalized Dynamics Generation towards Scannable Physical World Model
Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry-agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry-agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.
☆ Composition-Grounded Instruction Synthesis for Visual Reasoning
Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
☆ MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning ICCV 2025
Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
comment: ICCV 2025
☆ LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models
Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like Civit.ai, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.
☆ Constantly Improving Image Models Need Constantly Improving Benchmarks
Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at https://echo-bench.github.io.
☆ NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks
3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:https://jamesyjl.github.io/Nano3D
comment: Project Page: https://jamesyjl.github.io/Nano3D
☆ UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.
comment: Technical report. Project page: https://urbanverseproject.github.io/
☆ DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model's attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.
♻ ☆ Reasoning in Space via Grounding in the World
In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.
comment: 20 pages, 7 figures
♻ ☆ The Mechanistic Emergence of Symbol Grounding in Language Models
Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.
♻ ☆ NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
♻ ☆ OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild NeurIPS 2025
Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.
comment: Accepted to NeurIPS 2025; Project page: https://github.com/quhongyu/OmniGaze
♻ ☆ MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.
♻ ☆ High Semantic Features for the Continual Learning of Complex Emotions: a Lightweight Solution
Incremental learning is a complex process due to potential catastrophic forgetting of old tasks when learning new ones. This is mainly due to transient features that do not fit from task to task. In this paper, we focus on complex emotion recognition. First, we learn basic emotions and then, incrementally, like humans, complex emotions. We show that Action Units, describing facial muscle movements, are non-transient, highly semantical features that outperform those extracted by both shallow and deep convolutional neural networks. Thanks to this ability, our approach achieves interesting results when learning incrementally complex, compound emotions with an accuracy of 0.75 on the CFEE dataset and can be favorably compared to state-of-the-art results. Moreover, it results in a lightweight model with a small memory footprint.
comment: 10 pages, 14 figures
♻ ☆ Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation ICCV 2025
Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architectures. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method. Our code is available at https://github.com/jimmylin0979/PAT.git.
comment: Accepted by ICCV 2025
♻ ☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
♻ ☆ Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models
Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.
♻ ☆ CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation NeurIPS 2025
Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff
comment: Accepted by NeurIPS 2025
♻ ☆ SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.
♻ ☆ SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.
comment: Project page: https://research.nvidia.com/labs/sil/projects/simuli
♻ ☆ Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce VisualToolBench, a visual tool-use reasoning benchmark that rigorously evaluates MLLMs' ability to perceive, transform, and reason across complex visual-textual tasks under the think-with-images paradigm. VisualToolBench comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, VisualToolBench offers critical insights for advancing visual intelligence in MLLMs.
♻ ☆ TTT3R: 3D Reconstruction as Test-Time Training
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
comment: Page: https://rover-xingyu.github.io/TTT3R/ Code: https://github.com/Inception3D/TTT3R
♻ ☆ LinPrim: Linear Primitives for Differentiable Volumetric Rendering NeurIPS 2025
Volumetric rendering has become central to modern novel view synthesis methods, which use differentiable rendering to optimize 3D scene representations directly from observed views. While many recent works build on NeRF or 3D Gaussians, we explore an alternative volumetric scene representation. More specifically, we introduce two new scene representations based on linear primitives - octahedra and tetrahedra - both of which define homogeneous volumes bounded by triangular faces. To optimize these primitives, we present a differentiable rasterizer that runs efficiently on GPUs, allowing end-to-end gradient-based optimization while maintaining real-time rendering capabilities. Through experiments on real-world datasets, we demonstrate comparable performance to state-of-the-art volumetric methods while requiring fewer primitives to achieve similar reconstruction fidelity. Our findings deepen the understanding of 3D representations by providing insights into the fidelity and performance characteristics of transparent polyhedra and suggest that adopting novel primitives can expand the available design space.
comment: Project page: https://nicolasvonluetzow.github.io/LinPrim - Project video: https://youtu.be/NRRlmFZj5KQ - Accepted at NeurIPS 2025
♻ ☆ TABSurfer: a Hybrid Deep Learning Architecture for Subcortical Segmentation
Subcortical segmentation remains challenging despite its important applications in quantitative structural analysis of brain MRI scans. The most accurate method, manual segmentation, is highly labor intensive, so automated tools like FreeSurfer have been adopted to handle this task. However, these traditional pipelines are slow and inefficient for processing large datasets. In this study, we propose TABSurfer, a novel 3D patch-based CNN-Transformer hybrid deep learning model designed for superior subcortical segmentation compared to existing state-of-the-art tools. To evaluate, we first demonstrate TABSurfer's consistent performance across various T1w MRI datasets with significantly shorter processing times compared to FreeSurfer. Then, we validate against manual segmentations, where TABSurfer outperforms FreeSurfer based on the manual ground truth. In each test, we also establish TABSurfer's advantage over a leading deep learning benchmark, FastSurferVINN. Together, these studies highlight TABSurfer's utility as a powerful tool for fully automated subcortical segmentation with high fidelity.
comment: 5 pages, 3 figures, 2 tables
♻ ☆ Ctrl-VI: Controllable Video Synthesis via Variational Inference
Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop Ctrl-VI, a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.
comment: Project page: https://video-synthesis-variational.github.io/
♻ ☆ Shape of Motion: 4D Reconstruction from a Single Video ICCV 2025
Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos. We tackle the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we take advantage of off-the-shelf data-driven priors such as monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: https://shape-of-motion.github.io/
comment: ICCV 2025
♻ ☆ GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models ICCV 2025
Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is predominantly synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 3284 questions, covering five tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.0%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB and a lightweight GRAB-Lite to encourage progress in this important, growing domain.
comment: Accepted at ICCV 2025
♻ ☆ AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective
Talking-head animation focuses on generating realistic facial videos from audio input. Following Generative Adversarial Networks (GANs), diffusion models have become the mainstream, owing to their robust generative capacities. However, inherent limitations of the diffusion process often lead to inter-frame flicker and slow inference, restricting their practical deployment. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly by text or audio input. To mitigate flicker and ensure continuity, AvatarSync leverages an autoregressive pipeline that enhances temporal modeling. In addition, to ensure controllability, we introduce phonemes, which are the basic units of speech sounds, and construct a many-to-one mapping from text/audio to phonemes, enabling precise phoneme-to-visual alignment. Additionally, to further accelerate inference, we adopt a two-stage generation strategy that decouples semantic modeling from visual dynamics, and incorporate a customized Phoneme-Frame Causal Attention Mask to support multi-step parallel acceleration. Extensive experiments conducted on both Chinese (CMLR) and English (HDTF) datasets demonstrate that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.
♻ ☆ Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.
♻ ☆ Deep Few-view High-resolution Photon-counting CT at Halved Dose for Extremity Imaging
X-ray photon-counting computed tomography (PCCT) for extremity allows multi-energy high-resolution (HR) imaging but its radiation dose can be further improved. Despite the great potential of deep learning techniques, their application in HR volumetric PCCT reconstruction has been challenged by the large memory burden, training data scarcity, and domain gap issues. In this paper, we propose a deep learning-based approach for PCCT image reconstruction at halved dose and doubled speed validated in a New Zealand clinical trial. Specifically, we design a patch-based volumetric refinement network to alleviate the GPU memory limitation, train network with synthetic data, and use model-based iterative refinement to bridge the gap between synthetic and clinical data. Our results in a reader study of 8 patients from the clinical trial demonstrate a great potential to cut the radiation dose to half that of the clinical PCCT standard without compromising image quality and diagnostic value.
comment: Accepted for publication in IEEE TMI, https://ieeexplore.ieee.org/document/11200164; IEEE Transactions on Medical Imaging
♻ ☆ SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment
Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.
comment: Project Page: https://singhbino3d.github.io/sgpp/
♻ ☆ The Principle of Uncertain Maximum Entropy
The Principle of Maximum Entropy is a rigorous technique for estimating an unknown distribution given partial information while simultaneously minimizing bias. However, an important requirement for applying the principle is that the available information be provided error-free (Jaynes 1982). We relax this requirement using a memoryless communication channel as a framework to derive a new, more general principle. We show our new principle provides an upper bound on the entropy of the unknown distribution and the amount of information lost due to the use of a given communications channel is unknown unless the unknown distribution's entropy is also known. Using our new principle we provide a new interpretation of the classic principle and experimentally show its performance relative to the classic principle and other generally applicable solutions. Finally, we present a simple algorithm for solving our new principle and an approximation useful when samples are limited.
♻ ☆ Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models
As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.
♻ ☆ Pruning Sparse Tensor Neural Networks Enables Deep Learning for 3D Ultrasound Localization Microscopy
Ultrasound Localization Microscopy (ULM) is a non-invasive technique that allows for the imaging of micro-vessels in vivo, at depth and with a resolution on the order of ten microns. ULM is based on the sub-resolution localization of individual microbubbles injected in the bloodstream. Mapping the whole angioarchitecture requires the accumulation of microbubbles trajectories from thousands of frames, typically acquired over a few minutes. ULM acquisition times can be reduced by increasing the microbubble concentration, but requires more advanced algorithms to detect them individually. Several deep learning approaches have been proposed for this task, but they remain limited to 2D imaging, in part due to the associated large memory requirements. Herein, we propose to use sparse tensor neural networks to reduce memory usage in 2D and to improve the scaling of the memory requirement for the extension of deep learning architecture to 3D. We study several approaches to efficiently convert ultrasound data into a sparse format and study the impact of the associated loss of information. When applied in 2D, the sparse formulation reduces the memory requirements by a factor 2 at the cost of a small reduction of performance when compared against dense networks. In 3D, the proposed approach reduces memory requirements by two order of magnitude while largely outperforming conventional ULM in high concentration settings. We show that Sparse Tensor Neural Networks in 3D ULM allow for the same benefits as dense deep learning based method in 2D ULM i.e. the use of higher concentration in silico and reduced acquisition time.
♻ ☆ Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos ICCV 2025
Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.
comment: ICCV 2025. Project page: https://chenyi99.github.io/moto/
♻ ☆ Subspace-Boosted Model Merging
Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on both vision and language benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to quantify task similarity, offering a new interpretable perspective on model merging.
comment: 22 pages (main + supp)
♻ ☆ Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce Attention Surgery, an efficient framework for linearizing or hybridizing attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40\% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.
♻ ☆ WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving
Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose WorldSplat, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: (i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. (ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos. Project: https://wm-research.github.io/worldsplat/
♻ ☆ Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification
Lightweight convolutional and transformer-based networks are increasingly preferred for real-time image classification, especially on resource-constrained devices. This study evaluates the impact of hyperparameter optimization on the accuracy and deployment feasibility of seven modern lightweight architectures: ConvNeXt-T, EfficientNetV2-S, MobileNetV3-L, MobileViT v2 (S/XS), RepVGG-A2, and TinyViT-21M, trained on a class-balanced subset of 90,000 images from ImageNet-1K. Under standardized training settings, this paper investigates the influence of learning rate schedules, augmentation, optimizers, and initialization on model performance. Inference benchmarks are performed using an NVIDIA L40s GPU with batch sizes ranging from 1 to 512, capturing latency and throughput in real-time conditions. This work demonstrates that controlled hyperparameter variation significantly alters convergence dynamics in lightweight CNN and transformer backbones, providing insight into stability regions and deployment feasibility in edge artificial intelligence. Our results reveal that tuning alone leads to a top-1 accuracy improvement of 1.5 to 3.5 percent over baselines, and select models (e.g., RepVGG-A2, MobileNetV3-L) deliver latency under 5 milliseconds and over 9,800 frames per second, making them ideal for edge deployment. This work provides reproducible, subset-based insights into lightweight hyperparameter tuning and its role in balancing speed and accuracy. The code and logs may be seen at: https://vineetkumarrakesh.github.io/lcnn-opt
♻ ☆ SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions NeurIPS 2025
The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.
comment: to appear in NeurIPS 2025
♻ ☆ Online Continual Learning via Spiking Neural Networks with Sleep Enhanced Latent Replay
Edge computing scenarios necessitate the development of hardware-efficient online continual learning algorithms to be adaptive to dynamic environment. However, existing algorithms always suffer from high memory overhead and bias towards recently trained tasks. To tackle these issues, this paper proposes a novel online continual learning approach termed as SESLR, which incorporates a sleep enhanced latent replay scheme with spiking neural networks (SNNs). SESLR leverages SNNs' binary spike characteristics to store replay features in single bits, significantly reducing memory overhead. Furthermore, inspired by biological sleep-wake cycles, SESLR introduces a noise-enhanced sleep phase where the model exclusively trains on replay samples with controlled noise injection, effectively mitigating classification bias towards new classes. Extensive experiments on both conventional (MNIST, CIFAR10) and neuromorphic (NMNIST, CIFAR10-DVS) datasets demonstrate SESLR's effectiveness. On Split CIFAR10, SESLR achieves nearly 30% improvement in average accuracy with only one-third of the memory consumption compared to baseline methods. On Split CIFAR10-DVS, it improves accuracy by approximately 10% while reducing memory overhead by a factor of 32. These results validate SESLR as a promising solution for online continual learning in resource-constrained edge computing scenarios.
comment: There are some errors in the data of the paper. It will take some time to conduct additional experiments for adjustment. We hope to temporarily withdraw the preprint to prevent misleading others
♻ ☆ Multi-level Reliable Guidance for Unpaired Multi-view Clustering
In this thesis, we address the challenging problem of unpaired multi-view clustering (UMC), which aims to achieve effective joint clustering using unpaired samples observed across multiple views. Traditional incomplete multi-view clustering (IMC) methods typically rely on paired samples to capture complementary information between views. However, such strategies become impractical in the UMC due to the absence of paired samples. Although some researchers have attempted to address this issue by preserving consistent cluster structures across views, effectively mining such consistency remains challenging when the cluster structures {with low confidence}. Therefore, we propose a novel method, Multi-level Reliable Guidance for UMC (MRG-UMC), which integrates multi-level clustering and reliable view guidance to learn consistent and confident cluster structures from three perspectives. Specifically, inner-view multi-level clustering exploits high-confidence sample pairs across different levels to reduce the impact of boundary samples, resulting in more confident cluster structures. Synthesized-view alignment leverages a synthesized-view to mitigate cross-view discrepancies and promote consistency. Cross-view guidance employs a reliable view guidance strategy to enhance the clustering confidence of poorly clustered views. These three modules are jointly optimized across multiple levels to achieve consistent and confident cluster structures. Furthermore, theoretical analyses verify the effectiveness of MRG-UMC in enhancing clustering confidence. Extensive experimental results show that MRG-UMC outperforms state-of-the-art UMC methods, achieving an average NMI improvement of 12.95\% on multi-view datasets. {The source code is available at: https://anonymous.4open.science/r/MRG-UMC-5E20.
♻ ☆ EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture
In this paper, we propose EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model's ability to effectively process image patches. To address this, we introduce a layer-aligned encoder-decoder architecture, where the encoder utilizes self-attention to process image patches, while the decoder uses cross-attention to focus on the [CLS] token. Unlike traditional encoder-decoder framework, where the decoder depends solely on high-level encoder representations, EDIT allows the decoder to extract information starting from low-level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention maps, illustrating the refined, layer-by-layer focus on key image features. Experiments on ImageNet-1k and ImageNet-21k, along with transfer learning tasks, show that EDIT achieves consistent performance improvements over DeiT3 models. These results highlight the effectiveness of EDIT's design in addressing attention sink and improving visual feature extraction.
♻ ☆ Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
comment: Preprint
♻ ☆ 3DOT: Texture Transfer for 3DGS Objects from a Single Reference Image
3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce 3DSwapping, a 3D texture swapping method that integrates: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, our progressive generation process starts by editing a single reference image and gradually propagates the edits to adjacent views. Our view-consistency gradient guidance further reinforces consistency by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, we introduce prompt-tuning-based gradient guidance, which learns a token that precisely captures the difference between the reference image and the 3D object. This token then guides the editing process, ensuring more consistent texture preservation across views. Overall, 3DSwapping integrates these novel strategies to achieve higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that our three novel components enable convincing and effective 2D texture swapping for 3D objects. Code will be available upon acceptance.
♻ ☆ DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces NeurIPS 2025
The rise of realistic digital face generation and manipulation poses significant social risks. The primary challenge lies in the rapid and diverse evolution of generation techniques, which often outstrip the detection capabilities of existing models. To defend against the ever-evolving new types of forgery, we need to enable our model to quickly adapt to new domains with limited computation and data while avoiding forgetting previously learned forgery types. In this work, we posit that genuine facial samples are abundant and relatively stable in acquisition methods, while forgery faces continuously evolve with the iteration of manipulation techniques. Given the practical infeasibility of exhaustively collecting all forgery variants, we frame face forgery detection as a continual learning problem and allow the model to develop as new forgery types emerge. Specifically, we employ a Developmental Mixture of Experts (MoE) architecture that uses LoRA models as its individual experts. These experts are organized into two groups: a Real-LoRA to learn and refine knowledge of real faces, and multiple Fake-LoRAs to capture incremental information from different forgery types. To prevent catastrophic forgetting, we ensure that the learning direction of Fake-LoRAs is orthogonal to the established subspace. Moreover, we integrate orthogonal gradients into the orthogonal loss of Fake-LoRAs, preventing gradient interference throughout the training process of each task. Experimental results under both the datasets and manipulation types incremental protocols demonstrate the effectiveness of our method.
comment: Accepted by NeurIPS 2025
♻ ☆ HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection
Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.
comment: The submission was made prematurely. The authors plan to resubmit under the supervision of the corresponding author
♻ ☆ A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy
Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.
♻ ☆ HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging
Accurate liver and tumor segmentation on abdominal CT images is critical for reliable diagnosis and treatment planning, but remains challenging due to complex anatomical structures, variability in tumor appearance, and limited annotated data. To address these issues, we introduce Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network (HANS-Net), a novel segmentation framework that synergistically combines hyperbolic convolutions for hierarchical geometric representation, a wavelet-inspired decomposition module for multi-scale texture learning, a biologically motivated synaptic plasticity mechanism for adaptive feature enhancement, and an implicit neural representation branch to model fine-grained and continuous anatomical boundaries. Additionally, we incorporate uncertainty-aware Monte Carlo dropout to quantify prediction confidence and lightweight temporal attention to improve inter-slice consistency without sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap error (VOE) of 11.91%. Furthermore, cross-dataset validation on the AMOS 2022 dataset obtains an average Dice of 85.09%, IoU of 76.66%, ASSD of 19.49 mm, and VOE of 23.34%, indicating strong generalization across different datasets. These results confirm the effectiveness and robustness of HANS-Net in providing anatomically consistent, accurate, and confident liver and tumor segmentation.
comment: Manuscript under review in IEEE Transactions on Radiation and Plasma Medical Sciences
♻ ☆ Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations. To validate our approach, we integrate the framework into multiple MLLM families. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.
comment: 20 pages, 6 figures
♻ ☆ On Large Multimodal Models as Open-World Image Classifiers ICCV2025
Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.
comment: Accepted at ICCV2025, 23 pages, 13 figures, code is available at https://github.com/altndrr/lmms-owc
♻ ☆ ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On
Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.
comment: 21 pages
♻ ☆ Mapping Farmed Landscapes from Remote Sensing
Effective management of agricultural landscapes is critical for meeting global biodiversity targets, but efforts are hampered by the absence of detailed, large-scale ecological maps. To address this, we introduce Farmscapes, the first large-scale (covering most of England), high-resolution (25cm) map of rural landscape features, including ecologically vital elements like hedgerows, woodlands, and stone walls. This map was generated using a deep learning segmentation model trained on a novel, dataset of 942 manually annotated tiles derived from aerial imagery. Our model accurately identifies key habitats, achieving high f1-scores for woodland (96\%) and farmed land (95\%), and demonstrates strong capability in segmenting linear features, with an F1-score of 72\% for hedgerows. By releasing the England-wide map on Google Earth Engine, we provide a powerful, open-access tool for ecologists and policymakers. This work enables data-driven planning for habitat restoration, supports the monitoring of initiatives like the EU Biodiversity Strategy, and lays the foundation for advanced analysis of landscape connectivity.
♻ ☆ OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models SIGGRAPH
In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.
comment: Accepted to SIGGRAPH ASIA 2025. Project Page: https://dvirsamuel.github.io/omnimattezero.github.io/
♻ ☆ Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation
Recent studies of multimodal industrial anomaly detection (IAD) based on 3D point clouds and RGB images have highlighted the importance of exploiting the redundancy and complementarity among modalities for accurate classification and segmentation. However, achieving multimodal IAD in practical production lines remains a work in progress. It is essential to consider the trade-offs between the costs and benefits associated with the introduction of new modalities while ensuring compatibility with current processes. Existing quality control processes combine rapid in-line inspections, such as optical and infrared imaging with high-resolution but time-consuming near-line characterization techniques, including industrial CT and electron microscopy to manually or semi-automatically locate and analyze defects in the production of Li-ion batteries and composite materials. Given the cost and time limitations, only a subset of the samples can be inspected by all in-line and near-line methods, and the remaining samples are only evaluated through one or two forms of in-line inspection. To fully exploit data for deep learning-driven automatic defect detection, the models must have the ability to leverage multimodal training and handle incomplete modalities during inference. In this paper, we propose CMDIAD, a Cross-Modal Distillation framework for IAD to demonstrate the feasibility of a Multi-modal Training, Few-modal Inference (MTFI) pipeline. Our findings show that the MTFI pipeline can more effectively utilize incomplete multimodal information compared to applying only a single modality for training and inference. Moreover, we investigate the reasons behind the asymmetric performance improvement using point clouds or RGB images as the main modality of inference. This provides a foundation for our future multimodal dataset construction with additional modalities from manufacturing scenarios.
comment: For a published version refer to the Information Fusion, DOI. 10.1016/j.inffus.2025.103572
♻ ☆ TinyDef-DETR: A Transformer-Based Framework for Defect Detection in Transmission Lines from UAV Imagery
Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult objects. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous objects.
♻ ☆ UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation ICCV 2025
Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.
comment: ICCV 2025. Project Page: https://chaitanya100100.github.io/UniEgoMotion/
♻ ☆ ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 440 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.
comment: 58 pages
♻ ☆ InfoDet: A Dataset for Infographic Element Detection ICLR 2026
Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.
comment: Submitted to ICLR 2026
♻ ☆ The Fluorescent Veil: A Stealthy and Effective Physical Adversarial Patch Against Traffic Sign Recognition
Recently, traffic sign recognition (TSR) systems have become a prominent target for physical adversarial attacks. These attacks typically rely on conspicuous stickers and projections, or using invisible light and acoustic signals that can be easily blocked. In this paper, we introduce a novel attack medium, i.e., fluorescent ink, to design a stealthy and effective physical adversarial patch, namely FIPatch, to advance the state-of-the-art. Specifically, we first model the fluorescence effect in the digital domain to identify the optimal attack settings, which guide the real-world fluorescence parameters. By applying a carefully designed fluorescence perturbation to the target sign, the attacker can later trigger a fluorescent effect using invisible ultraviolet light, causing the TSR system to misclassify the sign and potentially leading to traffic accidents. We conducted a comprehensive evaluation to investigate the effectiveness of FIPatch, which shows a success rate of 98.31% in low-light conditions. Furthermore, our attack successfully bypasses five popular defenses and achieves a success rate of 96.72%.
♻ ☆ FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models
Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).
♻ ☆ TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving
Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.
♻ ☆ Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
♻ ☆ Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision ICCV 2025
Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://GroundLMM-ICCV.github.io.
comment: ICCV 2025 Findings
♻ ☆ TinyRS-R1: Compact Multimodal Language Model for Remote Sensing
Remote-sensing applications often run on edge hardware that cannot host today's 7B-parameter multimodal language models. This paper introduces TinyRS, the first 2B-parameter multimodal small language model (MSLM) optimized for remote sensing tasks, and TinyRS-R1, its reasoning-augmented variant. Built upon Qwen2-VL-2B, TinyRS is trained through a four-stage pipeline: pre-training on million satellite images, instruction tuning on visual instruction examples, fine-tuning with Chain-of-Thought (CoT) annotations from the proposed reasoning dataset, and alignment via Group Relative Policy Optimization (GRPO). TinyRS-R1 achieves or surpasses the performance of recent 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended question answering-while requiring just one-third of the memory and latency. Our analysis shows that CoT reasoning substantially benefits spatial grounding and scene understanding, while the non-reasoning TinyRS excels in concise, latency-sensitive VQA tasks. TinyRS-R1 represents the first domain-specialized MSLM with GRPO-aligned CoT reasoning for general-purpose remote sensing.
comment: Accepted to IEEE Geoscience and Remote Sensing Letters (GRSL). Code, models, and the captions for datasets are available at https://github.com/aybora/TinyRS
♻ ☆ WoW: Towards a World omniscient World model Through Embodied Interaction
Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.
♻ ☆ Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents
Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation.
♻ ☆ One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting NeurIPS 2025
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.
comment: 27 pages, 11 figures, to appear at NeurIPS 2025
♻ ☆ Flows and Diffusions on the Neural Manifold
Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques towards matching a gradient flow, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.
comment: 43 pages, 11 figures, 14 tables
♻ ☆ SOHES: Self-supervised Open-world Hierarchical Entity Segmentation ICLR 2024
Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES-ICLR.github.io.
comment: ICLR 2024
♻ ☆ Falcon: A Remote Sensing Vision-Language Foundation Model (Technical Report)
This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at https://github.com/TianHuiLab/Falcon, hoping to help further develop the open-source community.
♻ ☆ AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization
Text-to-image (T2I) customization empowers users to adapt the T2I diffusion model to new concepts absent in the pre-training dataset. On this basis, capturing multiple new concepts from a single image has emerged as a new task, allowing the model to learn multiple concepts simultaneously or discard unwanted concepts. However, multiple-concept disentanglement remains a key challenge. Existing disentanglement models often exhibit two main issues: feature fusion and asynchronous learning across different concepts. To address these issues, we propose AttenCraft, an attention-based method for multiple-concept disentanglement. Our method uses attention maps to generate accurate masks for each concept in a single initialization step, aiding in concept disentanglement without requiring mask preparation from humans or specialized models. Moreover, we introduce an adaptive algorithm based on attention scores to estimate sampling ratios for different concepts, promoting balanced feature acquisition and synchronized learning. AttenCraft also introduces a feature-retaining training framework that employs various loss functions to enhance feature recognition and prevent fusion. Extensive experiments show that our model effectively mitigates these two issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.
♻ ☆ A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation
Objective Renal cancer is a common malignancy and a major cause of cancer-related deaths. Computed tomography (CT) is central to early detection, staging, and treatment planning. However, the growing CT workload increases radiologists' burden and risks incomplete documentation. Automatically generating accurate reports remains challenging because it requires integrating visual interpretation with clinical reasoning. Advances in artificial intelligence (AI), especially large language and vision-language models, offer potential to reduce workload and enhance diagnostic quality. Methods We propose a clinically informed, two-stage framework for automatic renal CT report generation. In Stage 1, a multi-task learning model detects structured clinical features from each 2D image. In Stage 2, a vision-language model generates free-text reports conditioned on the image and the detected features. To evaluate clinical fidelity, generated clinical features are extracted from the reports and compared with expert-annotated ground truth. Results Experiments on an expert-labeled dataset show that incorporating detected features improves both report quality and clinical accuracy. The model achieved an average AUC of 0.75 for key imaging features and a METEOR score of 0.33, demonstrating higher clinical consistency and fewer template-driven errors. Conclusion Linking structured feature detection with conditioned report generation provides a clinically grounded approach to integrate structured prediction and narrative drafting for renal CT reporting. This method enhances interpretability and clinical faithfulness, underscoring the value of domain-relevant evaluation metrics for medical AI development.
♻ ☆ Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.
♻ ☆ On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise
Temporally consistent video-to-video generation is critical for applications such as style transfer and upsampling. In this paper, we provide a theoretical analysis of warped noise - a recently proposed technique for training video diffusion models - and show that pairing it with the standard denoising objective implicitly trains models to be equivariant to spatial transformations of the input noise, which we term EquiVDM. This equivariance enables motion in the input noise to align naturally with motion in the generated video, yielding coherent, high-fidelity outputs without the need for specialized modules or auxiliary losses. A further advantage is sampling efficiency: EquiVDM achieves comparable or superior quality in far fewer sampling steps. When distilled into one-step student models, EquiVDM preserves equivariance and delivers stronger motion controllability and fidelity than distilled nonequivariant baselines. Across benchmarks, EquiVDM consistently outperforms prior methods in motion alignment, temporal consistency, and perceptual quality, while substantially lowering sampling cost.
♻ ☆ ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation SIGGRAPH
Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.
comment: Accepted to SIGGRAPH Asia 2025, webpage: https://snap-research.github.io/composeme/
♻ ☆ MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition
Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.
♻ ☆ Impact of Regularization on Calibration and Robustness: from the Representation Space Perspective
Recent studies have shown that regularization techniques using soft labels, e.g., label smoothing, Mixup, and CutMix, not only enhance image classification accuracy but also mitigate miscalibration due to overconfident predictions, and improve robustness against adversarial attacks. However, the underlying mechanisms of such improvements remain underexplored. In this paper, we offer a novel explanation from the perspective of the representation space (i.e., the space of the features obtained at the penultimate layer). Based on examination of decision boundaries and structure of features (or representation vectors), our study investigates confidence contours and gradient directions within the representation space. Furthermore, we analyze the adjustments in feature distributions due to regularization in relation to these contours and directions, from which we uncover central mechanisms inducing improved calibration and robustness. Our findings provide new insights into the characteristics of the high-dimensional representation space in relation to training and regularization using soft labels.
♻ ☆ Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation
With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The code, model weights, and dataset can be found here: https://github.com/opendatalab/FakeVLM.
♻ ☆ ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization NeurIPS 2025
The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.
comment: NeurIPS 2025 DB Track Paper. Code available at: https://github.com/scu-zjz/ForensicHub
♻ ☆ SphereDrag: Spherical Geometry-Aware Panoramic Image Editing
Image editing has made great progress on planar images, but panoramic image editing remains underexplored. Due to their spherical geometry and projection distortions, panoramic images present three key challenges: boundary discontinuity, trajectory deformation, and uneven pixel density. To tackle these issues, we propose SphereDrag, a novel panoramic editing framework utilizing spherical geometry knowledge for accurate and controllable editing. Specifically, adaptive reprojection (AR) uses adaptive spherical rotation to deal with discontinuity; great-circle trajectory adjustment (GCTA) tracks the movement trajectory more accurate; spherical search region tracking (SSRT) adaptively scales the search range based on spherical location to address uneven pixel density. Also, we construct PanoBench, a panoramic editing benchmark, including complex editing tasks involving multiple objects and diverse styles, which provides a standardized evaluation framework. Experiments show that SphereDrag gains a considerable improvement compared with existing methods in geometric consistency and image quality, achieving up to 10.5% relative improvement.
comment: Accepted by PRCV 2025
♻ ☆ ELASTIC: Efficient Once For All Iterative Search for Object Detection on Microcontrollers
Deploying high-performance object detectors on TinyML platforms poses significant challenges due to tight hardware constraints and the modular complexity of modern detection pipelines. Neural Architecture Search (NAS) offers a path toward automation, but existing methods either restrict optimization to individual modules, sacrificing cross-module synergy, or require global searches that are computationally intractable. We propose ELASTIC (Efficient Once for AlL IterAtive Search for ObjecT DetectIon on MiCrocontrollers), a unified, hardware-aware NAS framework that alternates optimization across modules (e.g., backbone, neck, and head) in a cyclic fashion. ELASTIC introduces a novel Population Passthrough mechanism in evolutionary search that retains high-quality candidates between search stages, yielding faster convergence, up to an 8% final mAP gain, and eliminates search instability observed without population passthrough. In a controlled comparison, empirical results show ELASTIC achieves +4.75% higher mAP and 2x faster convergence than progressive NAS strategies on SVHN, and delivers a +9.09% mAP improvement on PascalVOC given the same search budget. ELASTIC achieves 72.3% mAP on PascalVOC, outperforming MCUNET by 20.9% and TinyissimoYOLO by 16.3%. When deployed on MAX78000/MAX78002 microcontrollers, ELASTICderived models outperform Analog Devices' TinySSD baselines, reducing energy by up to 71.6%, lowering latency by up to 2.4x, and improving mAP by up to 6.99 percentage points across multiple datasets.
comment: 8 pages, 7 figures
♻ ☆ CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding. Our project page is https://sensetime-fvg.github.io/CVD-STORM.
♻ ☆ RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.
♻ ☆ From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning ICCV 2025
Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an "easy to hard" approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.
comment: Accepted by ICCV 2025
♻ ☆ HuGDiffusion: Generalizable Single-Image Human Rendering via 3D Gaussian Diffusion
We present HuGDiffusion, a generalizable 3D Gaussian splatting (3DGS) learning pipeline to achieve novel view synthesis (NVS) of human characters from single-view input images. Existing approaches typically require monocular videos or calibrated multi-view images as inputs, whose applicability could be weakened in real-world scenarios with arbitrary and/or unknown camera poses. In this paper, we aim to generate the set of 3DGS attributes via a diffusion-based framework conditioned on human priors extracted from a single image. Specifically, we begin with carefully integrated human-centric feature extraction procedures to deduce informative conditioning signals. Based on our empirical observations that jointly learning the whole 3DGS attributes is challenging to optimize, we design a multi-stage generation strategy to obtain different types of 3DGS attributes. To facilitate the training process, we investigate constructing proxy ground-truth 3D Gaussian attributes as high-quality attribute-level supervision signals. Through extensive experiments, our HuGDiffusion shows significant performance improvements over the state-of-the-art methods. Our code will be made publicly available.
♻ ☆ CAP: Evaluation of Persuasive and Creative Image Generation
We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) generation and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. Our findings reveal that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models' capabilities in producing images that are better aligned, more creative, and more persuasive.
♻ ☆ TRACE: Your Diffusion Model is Secretly an Instance Edge Detector
High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Code is available at https://github.com/shjo-april/DiffEGG.
♻ ☆ EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.
comment: Tianyu Chen and Yasi Zhang contributed equally; Oscar Leong, Lijuan Wang, Ying Nian Wu, and Mingyuan Zhou advised equally
♻ ☆ UrbanTwin: Synthetic LiDAR Datasets (LUMPI, V2X-Real-IC, and TUMTraf-I)
This article presents UrbanTwin datasets, high-fidelity, realistic replicas of three public roadside lidar datasets: LUMPI, V2X-Real-IC}}, and TUMTraf-I. Each UrbanTwin dataset contains 10K annotated frames corresponding to one of the public datasets. Annotations include 3D bounding boxes, instance segmentation labels, and tracking IDs for six object classes, along with semantic segmentation labels for nine classes. These datasets are synthesized using emulated lidar sensors within realistic digital twins, modeled based on surrounding geometry, road alignment at lane level, and the lane topology and vehicle movement patterns at intersections of the actual locations corresponding to each real dataset. Due to the precise digital twin modeling, the synthetic datasets are well aligned with their real counterparts, offering strong standalone and augmentative value for training deep learning models on tasks such as 3D object detection, tracking, and semantic and instance segmentation. We evaluate the alignment of the synthetic replicas through statistical and structural similarity analysis with real data, and further demonstrate their utility by training 3D object detection models solely on synthetic data and testing them on real, unseen data. The high similarity scores and improved detection performance, compared to the models trained on real data, indicate that the UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity. In addition, the digital twins can be adapted to test custom scenarios by modifying the design and dynamics of the simulations. To our knowledge, these are the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks. UrbanTwin datasets are publicly available at https://dataverse.harvard.edu/dataverse/ucf-ut.
♻ ☆ Dataset Distillation for Super-Resolution without Class Labels and Pre-trained Models
Training deep neural networks has become increasingly demanding, requiring large datasets and significant computational resources, especially as model complexity advances. Data distillation methods, which aim to improve data efficiency, have emerged as promising solutions to this challenge. In the field of single image super-resolution (SISR), the reliance on large training datasets highlights the importance of these techniques. Recently, a generative adversarial network (GAN) inversion-based data distillation framework for SR was proposed, showing potential for better data utilization. However, the current method depends heavily on pre-trained SR networks and class-specific information, limiting its generalizability and applicability. To address these issues, we introduce a new data distillation approach for image SR that does not need class labels or pre-trained SR models. In particular, we first extract high-gradient patches and categorize images based on CLIP features, then fine-tune a diffusion model on the selected patches to learn their distribution and synthesize distilled training images. Experimental results show that our method achieves state-of-the-art performance while using significantly less training data and requiring less computational time. Specifically, when we train a baseline Transformer model for SR with only 0.68\% of the original dataset, the performance drop is just 0.3 dB. In this case, diffusion model fine-tuning takes 4 hours, and SR model training completes within 1 hour, much shorter than the 11-hour training time with the full dataset.
comment: code : https://github.com/sunwoocho/SRDD
♻ ☆ SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow NeurIPS
Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.
comment: The paper has been accepted to NeurIPS Creative AI Track 2025. Figure 4(c) has been accepted to CVPR AI Art Gallery 2025
♻ ☆ Bolt3D: Generating 3D Scenes in Seconds ICCV 2025
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
comment: ICCV 2025. Project page: https://szymanowiczs.github.io/bolt3d
♻ ☆ Universal Vessel Segmentation for Multi-Modality Retinal Images
We identify two major limitations in the existing studies on retinal vessel segmentation: (1) Most existing works are restricted to one modality, i.e., the Color Fundus (CF). However, multi-modality retinal images are used every day in the study of the retina and diagnosis of retinal diseases, and the study of vessel segmentation on other modalities is scarce; (2) Even though a few works extended their experiments to new modalities such as the Multi-Color Scanning Laser Ophthalmoscopy (MC), these works still require fine-tuning a separate model for the new modality. The fine-tuning will require extra training data, which is difficult to acquire. In this work, we present a novel universal vessel segmentation model (URVSM) for multi-modality retinal images. In addition to performing the study on a much wider range of image modalities, we also propose a universal model to segment the vessels in all these commonly used modalities. While being much more versatile compared with existing methods, our universal model also demonstrates comparable performance to the state-of-the-art fine-tuned methods. To the best of our knowledge, this is the first work that achieves modality-agnostic retinal vessel segmentation and the first to study retinal vessel segmentation in several novel modalities.
♻ ☆ MotionScript: Natural Language Descriptions for Expressive 3D Human Motions
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data.
comment: Project webpage: https://pjyazdian.github.io/MotionScript
♻ ☆ Optimally Deep Networks - Adapting Model Depth to Datasets for Superior Efficiency
Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.
comment: 6 pages, 3 figures, 1 table
♻ ☆ MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation
Clouds remain a critical challenge in optical satellite imagery, hindering reliable analysis for environmental monitoring, land cover mapping, and climate research. To overcome this, we propose MSCloudCAM, a Cross-Attention with Multi-Scale Context Network tailored for multispectral and multi-sensor cloud segmentation. Our framework exploits the spectral richness of Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data to classify four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow. MSCloudCAM combines a Swin Transformer backbone for hierarchical feature extraction with multi-scale context modules ASPP and PSP for enhanced scale-aware learning. A Cross-Attention block enables effective multisensor and multispectral feature fusion, while the integration of an Efficient Channel Attention Block (ECAB) and a Spatial Attention Module adaptively refine feature representations. Comprehensive experiments on CloudSEN12 and L8Biome demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy, surpassing leading baseline architectures while maintaining competitive parameter efficiency and FLOPs. These results underscore the model's effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.
comment: 7 pages, 2 Figures
♻ ☆ Multi-identity Human Image Animation with Structural Video Diffusion ICCV 2025
Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present \emph{Structural Video Diffusion}, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation. Code is available at https://github.com/zhenzhiwang/Multi-HumanVid
comment: ICCV 2025 camera ready
♻ ☆ ROI-GS: Interest-based Local Quality 3D Gaussian Splatting
We tackle the challenge of efficiently reconstructing 3D scenes with high detail on objects of interest. Existing 3D Gaussian Splatting (3DGS) methods allocate resources uniformly across the scene, limiting fine detail to Regions Of Interest (ROIs) and leading to inflated model size. We propose ROI-GS, an object-aware framework that enhances local details through object-guided camera selection, targeted Object training, and seamless integration of high-fidelity object of interest reconstructions into the global scene. Our method prioritizes higher resolution details on chosen objects while maintaining real-time performance. Experiments show that ROI-GS significantly improves local quality (up to 2.96 dB PSNR), while reducing overall model size by $\approx 17\%$ of baseline and achieving faster training for a scene with a single object of interest, outperforming existing methods.
comment: 4 pages, 3 figures, 3 tables
♻ ☆ TAS: A Transit-Aware Strategy for Embodied Navigation with Non-Stationary Targets
Embodied navigation methods commonly operate in static environments with stationary targets. In this work, we present a new algorithm for navigation in dynamic scenarios with non-stationary targets. Our novel Transit-Aware Strategy (TAS) enriches embodied navigation policies with object path information. TAS improves performance in non-stationary environments by rewarding agents for synchronizing their routes with target routes. To evaluate TAS, we further introduce Dynamic Object Maps (DOMs), a dynamic variant of node-attributed topological graphs with structured object transitions. DOMs are inspired by human habits to simulate realistic object routes on a graph. Our experiments show that on average, TAS improves agent Success Rate (SR) by 21.1 in non-stationary environments, while also generalizing better from static environments by 44.5% when measured by Relative Change in Success (RCS). We qualitatively investigate TAS-agent performance on DOMs and draw various inferences to help better model generalist navigation policies. To the best of our knowledge, ours is the first work that quantifies the adaptability of embodied navigation methods in non-stationary environments. Code and data for our benchmark will be made publicly available.
comment: 15 pages
♻ ☆ Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B\'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B\'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.
comment: 20 pages, 4 figures
♻ ☆ Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images
Light emission from galaxies exhibit diverse brightness profiles, influenced by factors such as galaxy type, structural features and interactions with other galaxies. Elliptical galaxies feature more uniform light distributions, while spiral and irregular galaxies have complex, varied light profiles due to their structural heterogeneity and star-forming activity. In addition, galaxies with an active galactic nucleus (AGN) feature intense, concentrated emission from gas accretion around supermassive black holes, superimposed on regular galactic light, while quasi-stellar objects (QSO) are the extreme case of the AGN emission dominating the galaxy. The challenge of identifying AGN and QSO has been discussed many times in the literature, often requiring multi-wavelength observations. This paper introduces a novel approach to identify AGN and QSO from a single image. Diffusion models have been recently developed in the machine-learning literature to generate realistic-looking images of everyday objects. Utilising the spatial resolving power of the Euclid VIS images, we created a diffusion model trained on one million sources, without using any source pre-selection or labels. The model learns to reconstruct light distributions of normal galaxies, since the population is dominated by them. We condition the prediction of the central light distribution by masking the central few pixels of each source and reconstruct the light according to the diffusion model. We further use this prediction to identify sources that deviate from this profile by examining the reconstruction error of the few central pixels regenerated in each source's core. Our approach, solely using VIS imaging, features high completeness compared to traditional methods of AGN and QSO selection, including optical, near-infrared, mid-infrared, and X-rays.
comment: Paper Accepted as part of the A&A Special Issue `Euclid Quick Data Release (Q1)', 34 pages, 26 figures
♻ ☆ On the Interaction of Compressibility and Adversarial Robustness
Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.
♻ ☆ ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation
Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: https://research.nvidia.com/labs/toronto-ai/chronoedit
comment: Project Page: https://research.nvidia.com/labs/toronto-ai/chronoedit
Machine Learning 284
☆ Agentic Design of Compositional Machines
The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
comment: 75 pages, 31 figures, Project Page: https://besiegefield.github.io
☆ Learning an Image Editing Model without Image Editing Pairs
Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
comment: project page: https://nupurkmr9.github.io/npedit/
☆ Terra: Explorable Native 3D World Model with Point Latents
World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
comment: Project Page: https://huang-yh.github.io/terra/
☆ pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.
comment: Code: https://github.com/Lakonik/piFlow Demos: https://huggingface.co/spaces/Lakonik/pi-Qwen and https://huggingface.co/spaces/Lakonik/pi-FLUX.1
Attention Is All You Need for KV Cache in Diffusion LLMs
This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
comment: https://vila-lab.github.io/elastic-cache-webpage/
☆ TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.
☆ LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.
comment: Preprint. Project page: https://ui-simulator.notion.site/llms-as-scalable-digital-world-simulator; Code and data: https://github.com/WadeYin9712/UI-Simulator
☆ Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability
We extend biologically-informed neural networks (BINNs) for genomic prediction (GP) and selection (GS) in crops by integrating thousands of single-nucleotide polymorphisms (SNPs) with multi-omics measurements and prior biological knowledge. Traditional genotype-to-phenotype (G2P) models depend heavily on direct mappings that achieve only modest accuracy, forcing breeders to conduct large, costly field trials to maintain or marginally improve genetic gain. Models that incorporate intermediate molecular phenotypes such as gene expression can achieve higher predictive fit, but they remain impractical for GS since such data are unavailable at deployment or design time. BINNs overcome this limitation by encoding pathway-level inductive biases and leveraging multi-omics data only during training, while using genotype data alone during inference. Applied to maize gene-expression and multi-environment field-trial data, BINN improves rank-correlation accuracy by up to 56% within and across subpopulations under sparse-data conditions and nonlinearly identifies genes that GWAS/TWAS fail to uncover. With complete domain knowledge for a synthetic metabolomics benchmark, BINN reduces prediction error by 75% relative to conventional neural nets and correctly identifies the most important nonlinear pathway. Importantly, both cases show highly sensitive BINN latent variables correlate with the experimental quantities they represent, despite not being trained on them. This suggests BINNs learn biologically-relevant representations, nonlinear or linear, from genotype to phenotype. Together, BINNs establish a framework that leverages intermediate domain information to improve genomic prediction accuracy and reveal nonlinear biological relationships that can guide genomic selection, candidate gene selection, pathway enrichment, and gene-editing prioritization.
comment: 35 pages, 12 figures
☆ RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks NeurIPS 2025
To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025); Project Website: rdd-neurips.github.io
☆ Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
☆ Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores
Pairwise comparisons of large language models using total variation distance mutual information (TVD-MI) produce binary critic decisions per pair. We show that averaging TVD-MI's binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions. Maximum-likelihood approaches to IRT use logistic links, but we find empirically that these transformations introduce curvature that breaks additivity: across three domains, the identity link yields median curl on raw data of 0.080-0.150 (P95 = [0.474, 0.580]), whereas probit/logit introduce substantially higher violations (median [0.245, 0.588], P95 [0.825, 2.252]). We derive this clipped-linear model from Gini entropy maximization, yielding a box-constrained least-squares formulation that handles boundary saturation. At 33% coverage, we achieve holdout RMSE $0.117 \pm 0.008$ while preserving agent rankings (Spearman $\rho = 0.972 \pm 0.015$), three times fewer evaluations than full dense. Judge robustness analysis (GPT-4o-mini vs. Llama3-70b) shows strong agreement in agent rankings ($\rho = 0.872$) and consistent identity-link advantage. TVD-MI's geometry is best preserved by identity mapping for efficient LLM evaluation, applicable to other bounded-response domains.
comment: 9 pages, 2 figures
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.
comment: Code can be found at https://github.com/seal-rg/recurrent-pretraining
☆ CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed \emph{online} via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs \emph{in training}. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: 8 pages
☆ DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
☆ Architecture Is All You Need: Diversity-Enabled Sweet Spots for Robust Humanoid Locomotion
Robust humanoid locomotion in unstructured environments requires architectures that balance fast low-level stabilization with slower perceptual decision-making. We show that a simple layered control architecture (LCA), a proprioceptive stabilizer running at high rate, coupled with a compact low-rate perceptual policy, enables substantially more robust performance than monolithic end-to-end designs, even when using minimal perception encoders. Through a two-stage training curriculum (blind stabilizer pretraining followed by perceptual fine-tuning), we demonstrate that layered policies consistently outperform one-stage alternatives in both simulation and hardware. On a Unitree G1 humanoid, our approach succeeds across stair and ledge tasks where one-stage perceptual policies fail. These results highlight that architectural separation of timescales, rather than network scale or complexity, is the key enabler for robust perception-conditioned locomotion.
comment: 8 pages
☆ LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.
comment: Work in progress. Github repo: https://github.com/RUCBM/LaSeR
☆ Circuit Insights: Towards Interpretability Beyond Activations
The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.
☆ VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tunin
Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at https://binghao-huang.github.io/vt_refine/.
comment: Accepted by 9th Conference on Robot Learning (CoRL 2025); Website: https://binghao-huang.github.io/vt_refine/
☆ Instruction Set Migration at Warehouse Scale
Migrating codebases from one instruction set architecture (ISA) to another is a major engineering challenge. A recent example is the adoption of Arm (in addition to x86) across the major Cloud hyperscalers. Yet, this problem has seen limited attention by the academic community. Most work has focused on static and dynamic binary translation, and the traditional conventional wisdom has been that this is the primary challenge. In this paper, we show that this is no longer the case. Modern ISA migrations can often build on a robust open-source ecosystem, making it possible to recompile all relevant software from scratch. This introduces a new and multifaceted set of challenges, which are different from binary translation. By analyzing a large-scale migration from x86 to Arm at Google, spanning almost 40,000 code commits, we derive a taxonomy of tasks involved in ISA migration. We show how Google automated many of the steps involved, and demonstrate how AI can play a major role in automatically addressing these tasks. We identify tasks that remain challenging and highlight research challenges that warrant further attention.
☆ Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing -- and selectively reducing -- overconfidence in reasoning systems. This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.
comment: 19 pages, 2 figures, preliminary version
☆ TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG
Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.
☆ Predicting Task Performance with Context-aware Scaling Laws
Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.
☆ Budget-aware Test-time Scaling via Discriminative Verification
Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3\% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a "free" upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.
☆ Learnable Mixed Nash Equilibria are Collectively Rational
We extend the study of learning in games to dynamics that exhibit non-asymptotic stability. We do so through the notion of uniform stability, which is concerned with equilibria of individually utility-seeking dynamics. Perhaps surprisingly, it turns out to be closely connected to economic properties of collective rationality. Under mild non-degeneracy conditions and up to strategic equivalence, if a mixed equilibrium is not uniformly stable, then it is not weakly Pareto optimal: there is a way for all players to improve by jointly deviating from the equilibrium. On the other hand, if it is locally uniformly stable, then the equilibrium must be weakly Pareto optimal. Moreover, we show that uniform stability determines the last-iterate convergence behavior for the family of incremental smoothed best-response dynamics, used to model individual and corporate behaviors in the markets. Unlike dynamics around strict equilibria, which can stabilize to socially-inefficient solutions, individually utility-seeking behaviors near mixed Nash equilibria lead to collective rationality.
☆ MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
☆ Reasoning with Sampling: Your Base Model is Smarter Than You Think
Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
☆ Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning
To preserve privacy, multi-party computation (MPC) enables executing Machine Learning (ML) algorithms on secret-shared or encrypted data. However, existing MPC frameworks are not optimized for sparse data. This makes them unsuitable for ML applications involving sparse data, e.g., recommender systems or genomics. Even in plaintext, such applications involve high-dimensional sparse data, that cannot be processed without sparsity-related optimizations due to prohibitively large memory requirements. Since matrix multiplication is central in ML algorithms, we propose MPC algorithms to multiply secret sparse matrices. On the one hand, our algorithms avoid the memory issues of the "dense" data representation of classic secure matrix multiplication algorithms. On the other hand, our algorithms can significantly reduce communication costs (some experiments show a factor 1000) for realistic problem sizes. We validate our algorithms in two ML applications in which existing protocols are impractical. An important question when developing MPC algorithms is what assumptions can be made. In our case, if the number of non-zeros in a row is a sensitive piece of information then a short runtime may reveal that the number of non-zeros is small. Existing approaches make relatively simple assumptions, e.g., that there is a universal upper bound to the number of non-zeros in a row. This often doesn't align with statistical reality, in a lot of sparse datasets the amount of data per instance satisfies a power law. We propose an approach which allows adopting a safe upper bound on the distribution of non-zeros in rows/columns of sparse matrices.
☆ Prediction-Specific Design of Learning-Augmented Algorithms
Algorithms with predictions} has emerged as a powerful framework to combine the robustness of traditional online algorithms with the data-driven performance benefits of machine-learned (ML) predictions. However, most existing approaches in this paradigm are overly conservative, {as they do not leverage problem structure to optimize performance in a prediction-specific manner}. In this paper, we show that such prediction-specific performance criteria can enable significant performance improvements over the coarser notions of consistency and robustness considered in prior work. Specifically, we propose a notion of \emph{strongly-optimal} algorithms with predictions, which obtain Pareto optimality not just in the worst-case tradeoff between robustness and consistency, but also in the prediction-specific tradeoff between these metrics. We develop a general bi-level optimization framework that enables systematically designing strongly-optimal algorithms in a wide variety of problem settings, and we propose explicit strongly-optimal algorithms for several classic online problems: deterministic and randomized ski rental, and one-max search. Our analysis reveals new structural insights into how predictions can be optimally integrated into online algorithms by leveraging a prediction-specific design. To validate the benefits of our proposed framework, we empirically evaluate our algorithms in case studies on problems including dynamic power management and volatility-based index trading. Our results demonstrate that prediction-specific, strongly-optimal algorithms can significantly improve performance across a variety of online decision-making settings.
☆ Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards
In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available. In this work, we formalize a model of learning with unbounded rewards without a mentor as a two-action contextual bandit with an abstain option: at each round the agent observes an input and chooses either to abstain (always 0 reward) or to commit (execute a preexisting task policy). Committing yields rewards that are upper-bounded but can be arbitrarily negative, and the commit reward is assumed Lipschitz in the input. We propose a caution-based algorithm that learns when not to learn: it chooses a trusted region and commits only where the available evidence does not already certify harm. Under these conditions and i.i.d. inputs, we establish sublinear regret guarantees, theoretically demonstrating the effectiveness of cautious exploration for deploying learning agents safely in high-stakes environments.
comment: 16 pages, 1 figure; under submission
☆ Predicting kernel regression learning curves from only raw data statistics
We study kernel regression with common rotation-invariant kernels on real datasets including CIFAR-5m, SVHN, and ImageNet. We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function $f_*$. The key new idea is an analytical approximation of a kernel's eigenvalues and eigenfunctions with respect to an anisotropic data distribution. The eigenfunctions resemble Hermite polynomials of the data, so we call this approximation the Hermite eigenstructure ansatz (HEA). We prove the HEA for Gaussian data, but we find that real image data is often "Gaussian enough" for the HEA to hold well in practice, enabling us to predict learning curves by applying prior results relating kernel eigenstructure to test risk. Extending beyond kernel regression, we empirically find that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by the HEA. Our HEA framework is a proof of concept that an end-to-end theory of learning which maps dataset structure all the way to model performance is possible for nontrivial learning algorithms on real datasets.
☆ From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR
General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.
☆ A Multi-Task Deep Learning Framework for Skin Lesion Classification, ABCDE Feature Quantification, and Evolution Simulation
Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.
☆ A Geometric Approach to Optimal Experimental Design
We introduce a novel geometric framework for optimal experimental design (OED). Traditional OED approaches, such as those based on mutual information, rely explicitly on probability densities, leading to restrictive invariance properties. To address these limitations, we propose the mutual transport dependence (MTD), a measure of statistical dependence grounded in optimal transport theory which provides a geometric objective for optimizing designs. Unlike conventional approaches, the MTD can be tailored to specific downstream estimation problems by choosing appropriate geometries on the underlying spaces. We demonstrate that our framework produces high-quality designs while offering a flexible alternative to standard information-theoretic techniques.
☆ Backdoor Unlearning by Linear Task Decomposition
Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
☆ Provable Unlearning with Gradient Ascent on Two-Layer ReLU Neural Networks
Machine Unlearning aims to remove specific data from trained models, addressing growing privacy and ethical concerns. We provide a theoretical analysis of a simple and widely used method - gradient ascent - used to reverse the influence of a specific data point without retraining from scratch. Leveraging the implicit bias of gradient descent towards solutions that satisfy the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem, we quantify the quality of the unlearned model by evaluating how well it satisfies these conditions w.r.t. the retained data. To formalize this idea, we propose a new success criterion, termed \textbf{$(\epsilon, \delta, \tau)$-successful} unlearning, and show that, for both linear models and two-layer neural networks with high dimensional data, a properly scaled gradient-ascent step satisfies this criterion and yields a model that closely approximates the retrained solution on the retained data. We also show that gradient ascent performs successful unlearning while still preserving generalization in a synthetic Gaussian-mixture setting.
☆ Reinforcement Learning with Stochastic Reward Machines AAAI
Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.
comment: A shorter version of this paper appeared in the Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22). Source code available at https://github.com/corazza/srm
☆ Intelligent Dynamic Handover via AI-assisted Signal Quality Prediction in 6G Multi-RAT Networks
The emerging paradigm of 6G multiple Radio Access Technology (multi-RAT) networks, where cellular and Wireless Fidelity (WiFi) transmitters coexist, requires mobility decisions that remain reliable under fast channel dynamics, interference, and heterogeneous coverage. Handover in multi-RAT deployments is still highly reactive and event-triggered, relying on instantaneous measurements and threshold events. This work proposes a Machine Learning (ML)-assisted Predictive Conditional Handover (P-CHO) framework based on a model-driven and short-horizon signal quality forecasts. We present a generalized P-CHO sequence workflow orchestrated by a RAT Steering Controller, which standardizes data collection, parallel per-RAT predictions, decision logic with hysteresis-based conditions, and CHO execution. Considering a realistic multi-RAT environment, we train RAT-aware Long Short Term Memory (LSTM) networks to forecast the signal quality indicators of mobile users along randomized trajectories. The proposed P-CHO models are trained and evaluated under different channel models for cellular and IEEE 802.11 WiFi integrated coverage. We study the impact of hyperparameter tuning of LSTM models under different system settings, and compare direct multi-step versus recursive P-CHO variants. Comparisons against baseline predictors are also carried out. Finally, the proposed P-CHO is tested under soft and hard handover settings, showing that hysteresis-enabled P-CHO scheme is able to reduce handover failures and ping-pong events. Overall, the proposed P-CHO framework can enable accurate, low-latency, and proactive handovers suitable for ML-assisted handover steering in 6G multi-RAT deployments.
comment: 9 pages, 17 figures
☆ RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained bu supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.
comment: https://lei-kun.github.io/RL-100/
☆ To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models
State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any ``truly long-form'' generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.
☆ Programmatic Representation Learning with Language Models
Classical models for supervised machine learning, such as decision trees, are efficient and interpretable predictors, but their quality is highly dependent on the particular choice of input features. Although neural networks can learn useful representations directly from raw data (e.g., images or text), this comes at the expense of interpretability and the need for specialized hardware to run them efficiently. In this paper, we explore a hypothesis class we call Learned Programmatic Representations (LeaPR) models, which stack arbitrary features represented as code (functions from data points to scalars) and decision tree predictors. We synthesize feature functions using Large Language Models (LLMs), which have rich prior knowledge in a wide range of domains and a remarkable ability to write code using existing domain-specific libraries. We propose two algorithms to learn LeaPR models from supervised data. First, we design an adaptation of FunSearch to learn features rather than directly generate predictors. Then, we develop a novel variant of the classical ID3 algorithm for decision tree learning, where new features are generated on demand when splitting leaf nodes. In experiments from chess position evaluation to image and text classification, our methods learn high-quality, neural network-free predictors often competitive with neural networks. Our work suggests a flexible paradigm for learning interpretable representations end-to-end where features and predictions can be readily inspected and understood.
comment: Code available at https://github.com/gpoesia/leapr/
☆ Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning
Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.
☆ Tackling Time-Series Forecasting Generalization via Mitigating Concept Drift
Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is important for time-series forecasting models to handle potential distribution shifts over time. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series forecasting, designing proper concept drift methods for time series forecasting has received comparatively less attention. Motivated by the need to address potential concept drift, while conventional concept drift methods via invariant learning face certain challenges in time-series forecasting, we propose a soft attention mechanism that finds invariant patterns from both lookback and horizon time series. Additionally, we emphasize the critical importance of mitigating temporal shifts as a preliminary to addressing concept drift. In this context, we introduce ShifTS, a method-agnostic framework designed to tackle temporal shift first and then concept drift within a unified approach. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and outperforming existing concept drift, temporal shift, and combined baselines.
comment: 17 pages, 6 figures, 4 tables
Efficient Dynamic Structured Sparse Training with Learned Shuffles
Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21\times$ and infers up to $2.9\times$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.
☆ Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning
Hebbian learning is a biological principle that intuitively describes how neurons adapt their connections through repeated stimuli. However, when applied to machine learning, it suffers serious issues due to the unconstrained updates of the connections and the lack of accounting for feedback mediation. Such shortcomings limit its effective scaling to complex network architectures and tasks. To this end, here we introduce the Structural Projection Hebbian Representation (SPHeRe), a novel unsupervised learning method that integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural information preservation backpropagates to the input through an auxiliary lightweight projection that conceptually serves as feedback mediation while the orthogonality constraints account for the boundedness of updating magnitude. Extensive experimental results show that SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on standard image classification benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet. Furthermore, the method exhibits strong effectiveness in continual learning and transfer learning scenarios, and image reconstruction tasks show the robustness and generalizability of the extracted features. This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules within modern deep learning frameworks, demonstrating the possibility of efficient and biologically inspired learning algorithms without the strong dependence on strict backpropagation. Our code is available at https://github.com/brain-intelligence-lab/SPHeRe.
☆ Agentic NL2SQL to Reduce Computational Costs NeurIPS 2025
Translating natural language queries into SQL queries (NL2SQL or Text-to-SQL) has recently been empowered by large language models (LLMs). Using LLMs to perform NL2SQL methods on a large collection of SQL databases necessitates processing large quantities of meta-information about the databases, which in turn results in lengthy prompts with many tokens and high processing costs. To address this challenge, we introduce Datalake Agent, an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently. Instead of utilizing direct solvers for NL2SQL that call the LLM once with all meta-information in the prompt, the Datalake Agent employs an interactive loop to reduce the utilized meta-information. Within the loop, the LLM is used in a reasoning framework that selectively requests only the necessary information to solve a table question answering task. We evaluate the Datalake Agent on a collection of 23 databases with 100 table question answering tasks. The Datalake Agent reduces the tokens used by the LLM by up to 87\% and thus allows for substantial cost reductions while maintaining competitive performance.
comment: Accepted at the NeurIPS 2025 Workshop on Efficient Reasoning. 10 pages, 11 figures
☆ Active Jammer Localization via Acquisition-Aware Path Planning
We propose an active jammer localization framework that combines Bayesian optimization with acquisition-aware path planning. Unlike passive crowdsourced methods, our approach adaptively guides a mobile agent to collect high-utility Received Signal Strength measurements while accounting for urban obstacles and mobility constraints. For this, we modified the A* algorithm, A-UCB*, by incorporating acquisition values into trajectory costs, leading to high-acquisition planned paths. Simulations on realistic urban scenarios show that the proposed method achieves accurate localization with fewer measurements compared to uninformed baselines, demonstrating consistent performance under different environments.
comment: 5 pages, 3 figures
☆ Causal Discovery for Linear DAGs with Dependent Latent Variables via Higher-order Cumulants
This paper addresses the problem of estimating causal directed acyclic graphs in linear non-Gaussian acyclic models with latent confounders (LvLiNGAM). Existing methods assume mutually independent latent confounders or cannot properly handle models with causal relationships among observed variables. We propose a novel algorithm that identifies causal DAGs in LvLiNGAM, allowing causal structures among latent variables, among observed variables, and between the two. The proposed method leverages higher-order cumulants of observed data to identify the causal structure. Extensive simulations and experiments with real-world data demonstrate the validity and practical utility of the proposed algorithm.
comment: 59 pages, 6 figures, and 3 tables
☆ Leveraging Code Cohesion Analysis to Identify Source Code Supply Chain Attacks
Supply chain attacks significantly threaten software security with malicious code injections within legitimate projects. Such attacks are very rare but may have a devastating impact. Detecting spurious code injections using automated tools is further complicated as it often requires deciphering the intention of both the inserted code and its context. In this study, we propose an unsupervised approach for highlighting spurious code injections by quantifying cohesion disruptions in the source code. Using a name-prediction-based cohesion (NPC) metric, we analyze how function cohesion changes when malicious code is introduced compared to natural cohesion fluctuations. An analysis of 54,707 functions over 369 open-source C++ repositories reveals that code injection reduces cohesion and shifts naming patterns toward shorter, less descriptive names compared to genuine function updates. Considering the sporadic nature of real supply-chain attacks, we evaluate the proposed method with extreme test-set imbalance and show that monitoring high-cohesion functions with NPC can effectively detect functions with injected code, achieving a Precision@100 of 36.41% at a 1:1,000 ratio and 12.47% at 1:10,000. These results suggest that automated cohesion measurements, in general, and name-prediction-based cohesion, in particular, may help identify supply chain attacks, improving source code integrity.
☆ Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
comment: Preprint. Under Review
☆ The Pursuit of Diversity: Multi-Objective Testing of Deep Reinforcement Learning Agents
Testing deep reinforcement learning (DRL) agents in safety-critical domains requires discovering diverse failure scenarios. Existing tools such as INDAGO rely on single-objective optimization focused solely on maximizing failure counts, but this does not ensure discovered scenarios are diverse or reveal distinct error types. We introduce INDAGO-Nexus, a multi-objective search approach that jointly optimizes for failure likelihood and test scenario diversity using multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies. We evaluated INDAGO-Nexus on three DRL agents: humanoid walker, self-driving car, and parking agent. On average, INDAGO-Nexus discovers up to 83% and 40% more unique failures (test effectiveness) than INDAGO in the SDC and Parking scenarios, respectively, while reducing time-to-failure by up to 67% across all agents.
comment: Pre-print - Accepted at Symposium on Search Based Software Engineering (SSBSE) 2025 co-located with ASE'25
☆ Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.
☆ Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
Increasing the batch size during training -- a ''batch ramp'' -- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36\%$, approaching the theoretical limit implied by our analysis.
Fast and Scalable Score-Based Kernel Calibration Tests
We introduce the Kernel Calibration Conditional Stein Discrepancy test (KCCSD test), a non-parametric, kernel-based test for assessing the calibration of probabilistic models with well-defined scores. In contrast to previous methods, our test avoids the need for possibly expensive expectation approximations while providing control over its type-I error. We achieve these improvements by using a new family of kernels for score-based probabilities that can be estimated without probability density samples, and by using a conditional goodness-of-fit criterion for the KCCSD test's U-statistic. We demonstrate the properties of our test on various synthetic settings.
comment: 26 pages
☆ MCbiF: Measuring Topological Autocorrelation in Multiscale Clusterings via 2-Parameter Persistent Homology
Datasets often possess an intrinsic multiscale structure with meaningful descriptions at different levels of coarseness. Such datasets are naturally described as multi-resolution clusterings, i.e., not necessarily hierarchical sequences of partitions across scales. To analyse and compare such sequences, we use tools from topological data analysis and define the Multiscale Clustering Bifiltration (MCbiF), a 2-parameter filtration of abstract simplicial complexes that encodes cluster intersection patterns across scales. The MCbiF can be interpreted as a higher-order extension of Sankey diagrams and reduces to a dendrogram for hierarchical sequences. We show that the multiparameter persistent homology (MPH) of the MCbiF yields a finitely presented and block decomposable module, and its stable Hilbert functions characterise the topological autocorrelation of the sequence of partitions. In particular, at dimension zero, the MPH captures violations of the refinement order of partitions, whereas at dimension one, the MPH captures higher-order inconsistencies between clusters across scales. We demonstrate through experiments the use of MCbiF Hilbert functions as topological feature maps for downstream machine learning tasks. MCbiF feature maps outperform information-based baseline features on both regression and classification tasks on synthetic sets of non-hierarchical sequences of partitions. We also show an application of MCbiF to real-world data to measure non-hierarchies in wild mice social grouping patterns across time.
☆ FedPPA: Progressive Parameter Alignment for Personalized Federated Learning
Federated Learning (FL) is designed as a decentralized, privacy-preserving machine learning paradigm that enables multiple clients to collaboratively train a model without sharing their data. In real-world scenarios, however, clients often have heterogeneous computational resources and hold non-independent and identically distributed data (non-IID), which poses significant challenges during training. Personalized Federated Learning (PFL) has emerged to address these issues by customizing models for each client based on their unique data distribution. Despite its potential, existing PFL approaches typically overlook the coexistence of model and data heterogeneity arising from clients with diverse computational capabilities. To overcome this limitation, we propose a novel method, called Progressive Parameter Alignment (FedPPA), which progressively aligns the weights of common layers across clients with the global model's weights. Our approach not only mitigates inconsistencies between global and local models during client updates, but also preserves client's local knowledge, thereby enhancing personalization robustness in non-IID settings. To further enhance the global model performance while retaining strong personalization, we also integrate entropy-based weighted averaging into the FedPPA framework. Experiments on three image classification datasets, including MNIST, FMNIST, and CIFAR-10, demonstrate that FedPPA consistently outperforms existing FL algorithms, achieving superior performance in personalized adaptation.
comment: 8 pages, TrustCom 2025 Conference
☆ Response to Discussions of "Causal and Counterfactual Views of Missing Data Models"
We are grateful to the discussants, Levis and Kennedy [2025], Luo and Geng [2025], Wang and van der Laan [2025], and Yang and Kim [2025], for their thoughtful comments on our paper (Nabi et al., 2025). In this rejoinder, we summarize our main contributions and respond to each discussion in turn.
☆ Online Reliable Anomaly Detection via Neuromorphic Sensing and Communications
This paper proposes a low-power online anomaly detection framework based on neuromorphic wireless sensor networks, encompassing possible use cases such as brain-machine interfaces and remote environmental monitoring. In the considered system, a central reader node actively queries a subset of neuromorphic sensor nodes (neuro-SNs) at each time frame. The neuromorphic sensors are event-driven, producing spikes in correspondence to relevant changes in the monitored system. The queried neuro-SNs respond to the reader with impulse radio (IR) transmissions that directly encode the sensed local events. The reader processes these event-driven signals to determine whether the monitored environment is in a normal or anomalous state, while rigorously controlling the false discovery rate (FDR) of detections below a predefined threshold. The proposed approach employs an online hypothesis testing method with e-values to maintain FDR control without requiring knowledge of the anomaly rate, and it dynamically optimizes the sensor querying strategy by casting it as a best-arm identification problem in a multi-armed bandit framework. Extensive performance evaluation demonstrates that the proposed method can reliably detect anomalies under stringent FDR requirements, while efficiently scheduling sensor communications and achieving low detection latency.
☆ When Planners Meet Reality: How Learned, Reactive Traffic Agents Shift nuPlan Benchmarks
Planner evaluation in closed-loop simulation often uses rule-based traffic agents, whose simplistic and passive behavior can hide planner deficiencies and bias rankings. Widely used IDM agents simply follow a lead vehicle and cannot react to vehicles in adjacent lanes, hindering tests of complex interaction capabilities. We address this issue by integrating the state-of-the-art learned traffic agent model SMART into nuPlan. Thus, we are the first to evaluate planners under more realistic conditions and quantify how conclusions shift when narrowing the sim-to-real gap. Our analysis covers 14 recent planners and established baselines and shows that IDM-based simulation overestimates planning performance: nearly all scores deteriorate. In contrast, many planners interact better than previously assumed and even improve in multi-lane, interaction-heavy scenarios like lane changes or turns. Methods trained in closed-loop demonstrate the best and most stable driving performance. However, when reaching their limits in augmented edge-case scenarios, all learned planners degrade abruptly, whereas rule-based planners maintain reasonable basic behavior. Based on our results, we suggest SMART-reactive simulation as a new standard closed-loop benchmark in nuPlan and release the SMART agents as a drop-in alternative to IDM at https://github.com/shgd95/InteractiveClosedLoop.
☆ Geometric Moment Alignment for Domain Adaptation via Siegel Embeddings
We address the problem of distribution shift in unsupervised domain adaptation with a moment-matching approach. Existing methods typically align low-order statistical moments of the source and target distributions in an embedding space using ad-hoc similarity measures. We propose a principled alternative that instead leverages the intrinsic geometry of these distributions by adopting a Riemannian distance for this alignment. Our key novelty lies in expressing the first- and second-order moments as a single symmetric positive definite (SPD) matrix through Siegel embeddings. This enables simultaneous adaptation of both moments using the natural geometric distance on the shared manifold of SPD matrices, preserving the mean and covariance structure of the source and target distributions and yielding a more faithful metric for cross-domain comparison. We connect the Riemannian manifold distance to the target-domain error bound, and validate the method on image denoising and image classification benchmarks. Our code is publicly available at https://github.com/shayangharib/GeoAdapt.
☆ Decorrelation Speeds Up Vision Transformers ICLR 2026
Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.
comment: 15 pages, 12 figures, submitted to ICLR 2026
Parameter Identification for Partial Differential Equation with Jump Discontinuities in Coefficients by Markov Switching Model and Physics-Informed Machine Learning
Inverse problems involving partial differential equations (PDEs) with discontinuous coefficients are fundamental challenges in modeling complex spatiotemporal systems with heterogeneous structures and uncertain dynamics. Traditional numerical and machine learning approaches often face limitations in addressing these problems due to high dimensionality, inherent nonlinearity, and discontinuous parameter spaces. In this work, we propose a novel computational framework that synergistically integrates physics-informed deep learning with Bayesian inference for accurate parameter identification in PDEs with jump discontinuities in coefficients. The core innovation of our framework lies in a dual-network architecture employing a gradient-adaptive weighting strategy: a main network approximates PDE solutions while a sub network samples its coefficients. To effectively identify mixture structures in parameter spaces, we employ Markovian dynamics methods to capture hidden state transitions of complex spatiotemporal systems. The framework has applications in reconstruction of solutions and identification of parameter-varying regions. Comprehensive numerical experiments on various PDEs with jump-varying coefficients demonstrate the framework's exceptional adaptability, accuracy, and robustness compared to existing methods. This study provides a generalizable computational approach of parameter identification for PDEs with discontinuous parameter structures, particularly in non-stationary or heterogeneous systems.
☆ Galaxy Morphology Classification with Counterfactual Explanation NeurIPS 2024
Galaxy morphologies play an essential role in the study of the evolution of galaxies. The determination of morphologies is laborious for a large amount of data giving rise to machine learning-based approaches. Unfortunately, most of these approaches offer no insight into how the model works and make the results difficult to understand and explain. We here propose to extend a classical encoder-decoder architecture with invertible flow, allowing us to not only obtain a good predictive performance but also provide additional information about the decision process with counterfactual explanations.
comment: Accepted to the Machine Learning and the Physical Sciences Workshop at NeurIPS 2024 (non-archival)
☆ LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching NeurIPS 2025
The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model's prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.
comment: Accepted as a poster presentation at NeurIPS 2025. Camera-ready version. 10 pages, 7 figures
☆ First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training
As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block's MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline.
☆ Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
comment: 12 pages, 6 figures, submitted for review
☆ Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking
Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBbind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.
Local Causal Discovery for Statistically Efficient Causal Inference
Causal discovery methods can identify valid adjustment sets for causal effect estimation for a pair of target variables, even when the underlying causal graph is unknown. Global causal discovery methods focus on learning the whole causal graph and therefore enable the recovery of optimal adjustment sets, i.e., sets with the lowest asymptotic variance, but they quickly become computationally prohibitive as the number of variables grows. Local causal discovery methods offer a more scalable alternative by focusing on the local neighborhood of the target variables, but are restricted to statistically suboptimal adjustment sets. In this work, we propose Local Optimal Adjustments Discovery (LOAD), a sound and complete causal discovery approach that combines the computational efficiency of local methods with the statistical optimality of global methods. First, LOAD identifies the causal relation between the targets and tests if the causal effect is identifiable by using only local information. If it is identifiable, it then finds the optimal adjustment set by leveraging local causal discovery to infer the mediators and their parents. Otherwise, it returns the locally valid parent adjustment sets based on the learned local structure. In our experiments on synthetic and realistic data LOAD outperforms global methods in scalability, while providing more accurate effect estimation than local methods.
☆ Selective Labeling with False Discovery Rate Control
Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbf{Conformal Labeling}, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal $p$-value for each test instance by comparing AI models' predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose $p$-values are below a data-dependent threshold, certifying AI models' predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.
☆ State-Space Models for Tabular Prior-Data Fitted Networks
Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM's inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.
☆ Redundancy-Aware Test-Time Graph Out-of-Distribution Detection NeurIPS 2025
Distributional discrepancy between training and test data can lead models to make inaccurate predictions when encountering out-of-distribution (OOD) samples in real-world applications. Although existing graph OOD detection methods leverage data-centric techniques to extract effective representations, their performance remains compromised by structural redundancy that induces semantic shifts. To address this dilemma, we propose RedOUT, an unsupervised framework that integrates structural entropy into test-time OOD detection for graph classification. Concretely, we introduce the Redundancy-aware Graph Information Bottleneck (ReGIB) and decompose the objective into essential information and irrelevant redundancy. By minimizing structural entropy, the decoupled redundancy is reduced, and theoretically grounded upper and lower bounds are proposed for optimization. Extensive experiments on real-world datasets demonstrate the superior performance of RedOUT on OOD detection. Specifically, our method achieves an average improvement of 6.7%, significantly surpassing the best competitor by 17.3% on the ClinTox/LIPO dataset pair.
comment: Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving
Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.
comment: To appear at the 58th International Symposium on Microarchitecture (MICRO 2025)
☆ Agentic Entropy-Balanced Policy Optimization
Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
comment: Working in progress
☆ A Deep State-Space Model Compression Method using Upper Bound on Output Error
We study deep state-space models (Deep SSMs) that contain linear-quadratic-output (LQO) systems as internal blocks and present a compression method with a provable output error guarantee. We first derive an upper bound on the output error between two Deep SSMs and show that the bound can be expressed via the $h^2$-error norms between the layerwise LQO systems, thereby providing a theoretical justification for existing model order reduction (MOR)-based compression. Building on this bound, we formulate an optimization problem in terms of the $h^2$-error norm and develop a gradient-based MOR method. On the IMDb task from the Long Range Arena benchmark, we demonstrate that our compression method achieves strong performance. Moreover, unlike prior approaches, we reduce roughly 80% of trainable parameters without retraining, with only a 4-5% performance drop.
☆ Symbol Grounding in Neuro-Symbolic AI: A Gentle Introduction to Reasoning Shortcuts
Neuro-symbolic (NeSy) AI aims to develop deep neural networks whose predictions comply with prior knowledge encoding, e.g. safety or structural constraints. As such, it represents one of the most promising avenues for reliable and trustworthy AI. The core idea behind NeSy AI is to combine neural and symbolic steps: neural networks are typically responsible for mapping low-level inputs into high-level symbolic concepts, while symbolic reasoning infers predictions compatible with the extracted concepts and the prior knowledge. Despite their promise, it was recently shown that - whenever the concepts are not supervised directly - NeSy models can be affected by Reasoning Shortcuts (RSs). That is, they can achieve high label accuracy by grounding the concepts incorrectly. RSs can compromise the interpretability of the model's explanations, performance in out-of-distribution scenarios, and therefore reliability. At the same time, RSs are difficult to detect and prevent unless concept supervision is available, which is typically not the case. However, the literature on RSs is scattered, making it difficult for researchers and practitioners to understand and tackle this challenging problem. This overview addresses this issue by providing a gentle introduction to RSs, discussing their causes and consequences in intuitive terms. It also reviews and elucidates existing theoretical characterizations of this phenomenon. Finally, it details methods for dealing with RSs, including mitigation and awareness strategies, and maps their benefits and limitations. By reformulating advanced material in a digestible form, this overview aims to provide a unifying perspective on RSs to lower the bar to entry for tackling them. Ultimately, we hope this overview contributes to the development of reliable NeSy and trustworthy AI models.
☆ Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models
In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.
comment: Appendix will be appended soon
☆ On the Identifiability of Tensor Ranks via Prior Predictive Matching
Selecting the latent dimensions (ranks) in tensor factorization is a central challenge that often relies on heuristic methods. This paper introduces a rigorous approach to determine rank identifiability in probabilistic tensor models, based on prior predictive moment matching. We transform a set of moment matching conditions into a log-linear system of equations in terms of marginal moments, prior hyperparameters, and ranks; establishing an equivalence between rank identifiability and the solvability of such system. We apply this framework to four foundational tensor-models, demonstrating that the linear structure of the PARAFAC/CP model, the chain structure of the Tensor Train model, and the closed-loop structure of the Tensor Ring model yield solvable systems, making their ranks identifiable. In contrast, we prove that the symmetric topology of the Tucker model leads to an underdetermined system, rendering the ranks unidentifiable by this method. For the identifiable models, we derive explicit closed-form rank estimators based on the moments of observed data only. We empirically validate these estimators and evaluate the robustness of the proposal.
☆ State Your Intention to Steer Your Attention: An AI Assistant for Intentional Digital Living
When working on digital devices, people often face distractions that can lead to a decline in productivity and efficiency, as well as negative psychological and emotional impacts. To address this challenge, we introduce a novel Artificial Intelligence (AI) assistant that elicits a user's intention, assesses whether ongoing activities are in line with that intention, and provides gentle nudges when deviations occur. The system leverages a large language model to analyze screenshots, application titles, and URLs, issuing notifications when behavior diverges from the stated goal. Its detection accuracy is refined through initial clarification dialogues and continuous user feedback. In a three-week, within-subjects field deployment with 22 participants, we compared our assistant to both a rule-based intent reminder system and a passive baseline that only logged activity. Results indicate that our AI assistant effectively supports users in maintaining focus and aligning their digital behavior with their intentions. Our source code is publicly available at this url https://intentassistant.github.io
☆ Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective
Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plugin-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at https://github.com/decisionintelligence/SRSNet.
☆ Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals
This paper proposes a reversible learning framework to improve the robustness and efficiency of value based Reinforcement Learning agents, addressing vulnerability to value overestimation and instability in partially irreversible environments. The framework has two complementary core mechanisms: an empirically derived transition reversibility measure called Phi of s and a, and a selective state rollback operation. We introduce an online per state action estimator called Phi that quantifies the likelihood of returning to a prior state within a fixed horizon K. This measure is used to adjust the penalty term during temporal difference updates dynamically, integrating reversibility awareness directly into the value function. The system also includes a selective rollback operator. When an action yields an expected return markedly lower than its instantaneous estimated value and violates a predefined threshold, the agent is penalized and returns to the preceding state rather than progressing. This interrupts sub optimal high risk trajectories and avoids catastrophic steps. By combining reversibility aware evaluation with targeted rollback, the method improves safety, performance, and stability. In the CliffWalking v0 domain, the framework reduced catastrophic falls by over 99.8 percent and yielded a 55 percent increase in mean episode return. In the Taxi v3 domain, it suppressed illegal actions by greater than or equal to 99.9 percent and achieved a 65.7 percent improvement in cumulative reward, while also sharply reducing reward variance in both environments. Ablation studies confirm that the rollback mechanism is the critical component underlying these safety and performance gains, marking a robust step toward safe and reliable sequential decision making.
comment: Submitted PLOS ONE
☆ From Guess2Graph: When and How Can Unreliable Experts Safely Boost Causal Discovery in Finite Samples?
Causal discovery algorithms often perform poorly with limited samples. While integrating expert knowledge (including from LLMs) as constraints promises to improve performance, guarantees for existing methods require perfect predictions or uncertainty estimates, making them unreliable for practical use. We propose the Guess2Graph (G2G) framework, which uses expert guesses to guide the sequence of statistical tests rather than replacing them. This maintains statistical consistency while enabling performance improvements. We develop two instantiations of G2G: PC-Guess, which augments the PC algorithm, and gPC-Guess, a learning-augmented variant designed to better leverage high-quality expert input. Theoretically, both preserve correctness regardless of expert error, with gPC-Guess provably outperforming its non-augmented counterpart in finite samples when experts are "better than random." Empirically, both show monotonic improvement with expert accuracy, with gPC-Guess achieving significantly stronger gains.
☆ Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning
Fine-tuning large pretrained language models is a common approach for aligning them with human preferences, but noisy or off-target examples can dilute supervision. While small, well-chosen datasets often match the performance of much larger ones, systematic and efficient ways to identify high-value training data remain underexplored. Many current methods rely on heuristics or expensive retraining. We present a theoretically grounded, resource-efficient framework for data selection and reweighting. At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example by conditioning on a small, curated holdout set in context. ICA requires no reference model and no additional finetuning. Under a local linearization, ICA is equivalent to a first-order update toward the holdout optimum, motivating its use as a proxy for data value. We derive per-example weights from ICA scores, dynamically reweighting gradient updates as model parameters evolve. Across SFT, DPO, and SimPO, and over diverse backbones and datasets, ICA-based reweighting consistently improves model alignment with minimal overhead. We analyze sensitivity to score update frequency and the choice of $k$ holdout examples for in-context demonstrations, and note limitations for rapidly drifting on-policy updates, highlighting directions for future work. Code and prompts will be released.
☆ Coder as Editor: Code-driven Interpretable Molecular Optimization
Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge. While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. We introduce MECo, a framework that bridges reasoning and execution by translating editing actions into executable code. MECo reformulates molecular optimization for LLMs as a cascaded framework: generating human-interpretable editing intentions from a molecule and property goal, followed by translating those intentions into executable structural edits via code generation. Our approach achieves over 98% accuracy in reproducing held-out realistic edits derived from chemical reactions and target-specific compound pairs. On downstream optimization benchmarks spanning physicochemical properties and target activities, MECo substantially improves consistency by 38-86 percentage points to 90%+ and achieves higher success rates over SMILES-based baselines while preserving structural similarity. By aligning intention with execution, MECo enables consistent, controllable and interpretable molecular design, laying the foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.
☆ Feature Selection and Regularization in Multi-Class Classification: An Empirical Study of One-vs-Rest Logistic Regression with Gradient Descent Optimization and L1 Sparsity Constraints
Multi-class wine classification presents fundamental trade-offs between model accuracy, feature dimensionality, and interpretability - critical factors for production deployment in analytical chemistry. This paper presents a comprehensive empirical study of One-vs-Rest logistic regression on the UCI Wine dataset (178 samples, 3 cultivars, 13 chemical features), comparing from-scratch gradient descent implementation against scikit-learn's optimized solvers and quantifying L1 regularization effects on feature sparsity. Manual gradient descent achieves 92.59 percent mean test accuracy with smooth convergence, validating theoretical foundations, though scikit-learn provides 24x training speedup and 98.15 percent accuracy. Class-specific analysis reveals distinct chemical signatures with heterogeneous patterns where color intensity varies dramatically (0.31 to 16.50) across cultivars. L1 regularization produces 54-69 percent feature reduction with only 4.63 percent accuracy decrease, demonstrating favorable interpretability-performance trade-offs. We propose an optimal 5-feature subset achieving 62 percent complexity reduction with estimated 92-94 percent accuracy, enabling cost-effective deployment with 80 dollars savings per sample and 56 percent time reduction. Statistical validation confirms robust generalization with sub-2ms prediction latency suitable for real-time quality control. Our findings provide actionable guidelines for practitioners balancing comprehensive chemical analysis against targeted feature measurement in resource-constrained environments.
comment: 29 pages, 7 figures, 5 tables. Submitted to Machine Learning track. Comprehensive empirical evaluation of interpretable linear classification for analytical chemistry applications with focus on production deployment constraints, cost-benefit analysis, and class-specific feature importance patterns
☆ Towards geological inference with process-based and deep generative modeling, part 1: training on fluvial deposits
The distribution of resources in the subsurface is deeply linked to the variations of its physical properties. Generative modeling has long been used to predict those physical properties while quantifying the associated uncertainty. But current approaches struggle to properly reproduce geological structures, and fluvial deposits in particular, because of their continuity. This study explores whether a generative adversarial network (GAN) - a type of deep-learning algorithm for generative modeling - can be trained to reproduce fluvial deposits simulated by a process-based model - a more expensive model that mimics geological processes. An ablation study shows that developments from the deep-learning community to generate large 2D images are directly transferable to 3D images of fluvial deposits. Training remains stable, and the generated samples reproduce the non-stationarity and details of the deposits without mode collapse or pure memorization of the training data. Using a process-based model to generate those training data allows us to include valuable properties other than the usual physical properties. We show how the deposition time let us monitor and validate the performance of a GAN by checking that its samples honor the law of superposition. Our work joins a series of previous studies suggesting that GANs are more robust that given credit for, at least for training datasets targeting specific geological structures. Whether this robustness transfers to larger 3D images and multimodal datasets remains to be seen. Exploring how deep generative models can leverage geological principles like the law of superposition shows a lot of promise.
comment: 24 pages, 16 figures
☆ A Free Lunch in LLM Compression: Revisiting Retraining after Pruning
While Neural Network pruning typically requires retraining the model to recover pruning-induced performance degradation, state-of-the-art Large Language Models (LLMs) pruning methods instead solve a layer-wise mask selection and reconstruction problem on a small set of calibration data to avoid full retraining, as it is considered computationally infeasible for LLMs. Reconstructing single matrices in isolation has favorable properties, such as convexity of the objective and significantly reduced memory requirements compared to full retraining. In practice, however, reconstruction is often implemented at coarser granularities, e.g., reconstructing a whole transformer block against its dense activations instead of a single matrix. In this work, we study the key design choices when reconstructing or retraining the remaining weights after pruning. We conduct an extensive computational study on state-of-the-art GPT architectures, and report several surprising findings that challenge common intuitions about retraining after pruning. In particular, we observe a free lunch scenario: reconstructing attention and MLP components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity. Most importantly, this Pareto-optimal setup achieves better performance than full retraining, despite requiring only a fraction of the memory. Furthermore, we demonstrate that simple and efficient pruning criteria such as Wanda can outperform much more complex approaches when the reconstruction step is properly executed, highlighting its importance. Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery for LLMs.
☆ MergeMoE: Efficient Compression of MoE Models via Expert Output Merging
The Mixture-of-Experts (MoE) technique has proven to be a promising solution to efficiently scale the model size, which has been widely applied in recent LLM advancements. However, the substantial memory overhead of MoE models has made their compression an important research direction. In this work, we provide a theoretical analysis of expert merging, a recently proposed technique for compressing MoE models. Rather than interpreting expert merging from the conventional perspective of parameter aggregation, we approach it from the perspective of merging experts' outputs. Our key insight is that the merging process can be interpreted as inserting additional matrices into the forward computation, which naturally leads to an optimization formulation. Building on this analysis, we introduce MergeMoE, a method that leverages mathematical optimization to construct the compression matrices. We evaluate MergeMoE on multiple MoE models and show that our algorithm consistently outperforms the baselines with the same compression ratios.
☆ Interaction Concordance Index: Performance Evaluation for Interaction Prediction Methods
Consider two sets of entities and their members' mutual affinity values, say drug-target affinities (DTA). Drugs and targets are said to interact in their effects on DTAs if drug's effect on it depends on the target. Presence of interaction implies that assigning a drug to a target and another drug to another target does not provide the same aggregate DTA as the reversed assignment would provide. Accordingly, correctly capturing interactions enables better decision-making, for example, in allocation of limited numbers of drug doses to their best matching targets. Learning to predict DTAs is popularly done from either solely from known DTAs or together with side information on the entities, such as chemical structures of drugs and targets. In this paper, we introduce interaction directions' prediction performance estimator we call interaction concordance index (IC-index), for both fixed predictors and machine learning algorithms aimed for inferring them. IC-index complements the popularly used DTA prediction performance estimators by evaluating the ratio of correctly predicted directions of interaction effects in data. First, we show the invariance of IC-index on predictors unable to capture interactions. Secondly, we show that learning algorithm's permutation equivariance regarding drug and target identities implies its inability to capture interactions when either drug, target or both are unseen during training. In practical applications, this equivariance is remedied via incorporation of appropriate side information on drugs and targets. We make a comprehensive empirical evaluation over several biomedical interaction data sets with various state-of-the-art machine learning algorithms. The experiments demonstrate how different types of affinity strength prediction methods perform in terms of IC-index complementing existing prediction performance estimators.
☆ Personalized federated learning, Row-wise fusion regularization, Multivariate modeling, Sparse estimation
We study personalized federated learning for multivariate responses where client models are heterogeneous yet share variable-level structure. Existing entry-wise penalties ignore cross-response dependence, while matrix-wise fusion over-couples clients. We propose a Sparse Row-wise Fusion (SROF) regularizer that clusters row vectors across clients and induces within-row sparsity, and we develop RowFed, a communication-efficient federated algorithm that embeds SROF into a linearized ADMM framework with privacy-preserving partial participation. Theoretically, we establish an oracle property for SROF-achieving correct variable-level group recovery with asymptotic normality-and prove convergence of RowFed to a stationary solution. Under random client participation, the iterate gap contracts at a rate that improves with participation probability. Empirically, simulations in heterogeneous regimes show that RowFed consistently lowers estimation and prediction error and strengthens variable-level cluster recovery over NonFed, FedAvg, and a personalized matrix-fusion baseline. A real-data study further corroborates these gains while preserving interpretability. Together, our results position row-wise fusion as an effective and transparent paradigm for large-scale personalized federated multivariate learning, bridging the gap between entry-wise and matrix-wise formulations.
☆ Revisit Modality Imbalance at the Decision Layer
Multimodal learning integrates information from different modalities to enhance model performance, yet it often suffers from modality imbalance, where dominant modalities overshadow weaker ones during joint optimization. This paper reveals that such an imbalance not only occurs during representation learning but also manifests significantly at the decision layer. Experiments on audio-visual datasets (CREMAD and Kinetic-Sounds) show that even after extensive pretraining and balanced optimization, models still exhibit systematic bias toward certain modalities, such as audio. Further analysis demonstrates that this bias originates from intrinsic disparities in feature-space and decision-weight distributions rather than from optimization dynamics alone. We argue that aggregating uncalibrated modality outputs at the fusion stage leads to biased decision-layer weighting, hindering weaker modalities from contributing effectively. To address this, we propose that future multimodal systems should focus more on incorporate adaptive weight allocation mechanisms at the decision layer, enabling relative balanced according to the capabilities of each modality.
comment: Some Insights in Balanced Multimodal Learning
☆ Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
Current transformer accelerators primarily focus on optimizing self-attention due to its quadratic complexity. However, this focus is less relevant for vision transformers with short token lengths, where the Feed-Forward Network (FFN) tends to be the dominant computational bottleneck. This paper presents a low power Vision Transformer accelerator, optimized through algorithm-hardware co-design. The model complexity is reduced using hardware-friendly dynamic token pruning without introducing complex mechanisms. Sparsity is further improved by replacing GELU with ReLU activations and employing dynamic FFN2 pruning, achieving a 61.5\% reduction in operations and a 59.3\% reduction in FFN2 weights, with an accuracy loss of less than 2\%. The hardware adopts a row-wise dataflow with output-oriented data access to eliminate data transposition, and supports dynamic operations with minimal area overhead. Implemented in TSMC's 28nm CMOS technology, our design occupies 496.4K gates and includes a 232KB SRAM buffer, achieving a peak throughput of 1024 GOPS at 1GHz, with an energy efficiency of 2.31 TOPS/W and an area efficiency of 858.61 GOPS/mm2.
comment: 10 pages; IEEE Transactions on Circuits and Systems I: Regular Papers
☆ Beat Detection as Object Detection
Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.
comment: 11 pages, 4 figures, 5 tables
☆ BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble WACV 2026
Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.
comment: This paper has been submitted to IEEE/CVF WACV 2026 Applications track and is currently under review
☆ SHaRe-SSM: An Oscillatory Spiking Neural Network for Target Variable Modeling in Long Sequences
In recent years, with the emergence of large models, there has been a significant interest in spiking neural networks (SNNs) primarily due to their energy efficiency, multiplication-free, and sparse event-based deep learning. Similarly, state space models (SSMs) in varying designs have evolved as a powerful alternative to transformers for target modeling in long sequences, thereby overcoming the quadratic dependence on sequence length of a transformer. Inspired by this progress, we here design SHaRe-SSM (Spiking Harmonic Resonate and Fire State Space Model), for target variable modeling (including both classification and regression) for very-long-range sequences. Our second-order spiking SSM, on average, performs better than transformers or first-order SSMs while circumventing multiplication operations, making it ideal for resource-constrained applications. The proposed block consumes $73 \times$ less energy than second-order ANN-based SSMs for an 18k sequence, while retaining performance. To ensure learnability over the long-range sequences, we propose exploiting the stable and efficient implementation of the dynamical system using parallel scans. Moreover, for the first time, we propose a kernel-based spiking regressor using resonate and fire neurons for very long-range sequences. Our network shows superior performance on even a 50k sequence while being significantly energy-efficient. In addition, we conducted a systematic analysis of the impact of heterogeneity, dissipation, and conservation in resonate-and-fire SSMs.
☆ Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $\Delta$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $\Delta$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.
☆ PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora
Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.
☆ Jet Functors and Weil Algebras in Automatic Differentiation: A Geometric Analysis
We present a geometric formulation of automatic differentiation (AD) using jet bundles and Weil algebras. Reverse-mode AD emerges as cotangent-pullback, while Taylor-mode corresponds to evaluation in a Weil algebra. From these principles, we derive concise statements on correctness, stability, and complexity: a functorial identity for reverse-mode, algebraic exactness of higher-order derivatives, and explicit bounds on truncation error. We further show that tensorized Weil algebras permit one-pass computation of all mixed derivatives with cost linear in the algebra dimension, avoiding the combinatorial blow-up of nested JVP/VJP schedules. This framework interprets AD theory through the lens of differential geometry and offers a foundation for developing structure-preserving differentiation methods in deep learning and scientific computing. Code and examples are available at https://git.nilu.no/geometric-ad/jet-weil-ad.
☆ A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities
Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.
☆ Stop-RAG: Value-Based Retrieval Control for Iterative RAG NeurIPS 2025
Iterative retrieval-augmented generation (RAG) enables large language models to answer complex multi-hop questions, but each additional loop increases latency, costs, and the risk of introducing distracting evidence, motivating the need for an efficient stopping strategy. Existing methods either use a predetermined number of iterations or rely on confidence proxies that poorly reflect whether more retrieval will actually help. We cast iterative RAG as a finite-horizon Markov decision process and introduce Stop-RAG, a value-based controller that adaptively decides when to stop retrieving. Trained with full-width forward-view Q($\lambda$) targets from complete trajectories, Stop-RAG learns effective stopping policies while remaining compatible with black-box APIs and existing pipelines. On multi-hop question-answering benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines and prompting-based stopping with LLMs. These results highlight adaptive stopping as a key missing component in current agentic systems, and demonstrate that value-based control can improve the accuracy of RAG systems.
comment: NeurIPS 2025 MTI-LLM Workshop
☆ DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis
Graph Transformers (GTs) have emerged as powerful architectures for graph-structured data, yet remain constrained by rigid designs and lack quantifiable interpretability. Current state-of-the-art GTs commit to fixed GNN types across all layers, missing potential benefits of depth-specific component selection, while their complex architectures become opaque where performance gains cannot be distinguished between meaningful patterns and spurious correlations. We redesign GT attention through asymmetry, decoupling structural encoding from feature representation: queries derive from node features while keys and values come from GNN transformations. Within this framework, we use Differentiable ARchiTecture Search (DARTS) to select optimal GNN operators at each layer, enabling depth-wise heterogeneity inside transformer attention itself (DARTS-GT). To understand discovered architectures, we develop the first quantitative interpretability framework for GTs through causal ablation. Our metrics (Head-deviation, Specialization, and Focus), identify which heads and nodes drive predictions while enabling model comparison. Experiments across eight benchmarks show DARTS-GT achieves state-of-the-art on four datasets while remaining competitive on others, with discovered architectures revealing dataset-specific patterns. Our interpretability analysis reveals that visual attention salience and causal importance do not always correlate, indicating widely used visualization approaches may miss components that actually matter. Crucially, heterogeneous architectures found by DARTS-GT consistently produced more interpretable models than baselines, establishing that Graph Transformers need not choose between performance and interpretability.
☆ A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer's Disease
Early detection of Alzheimer's Disease (AD) is greatly beneficial to AD patients, leading to early treatments that lessen symptoms and alleviating financial burden of health care. As one of the leading signs of AD, language capability changes can be used for early diagnosis of AD. In this paper, I develop a robust classification method using hybrid word embedding and fine-tuned hyperparameters to achieve state-of-the-art accuracy in the early detection of AD. Specifically, we create a hybrid word embedding based on word vectors from Doc2Vec and ELMo to obtain perplexity scores of the sentences. The scores identify whether a sentence is fluent or not and capture semantic context of the sentences. I enrich the word embedding by adding linguistic features to analyze syntax and semantics. Further, we input an embedded feature vector into logistic regression and fine tune hyperparameters throughout the pipeline. By tuning hyperparameters of the machine learning pipeline (e.g., model regularization parameter, learning rate and vector size of Doc2Vec, and vector size of ELMo), I achieve 91% classification accuracy and an Area Under the Curve (AUC) of 97% in distinguishing early AD from healthy subjects. Based on my knowledge, my model with 91% accuracy and 97% AUC outperforms the best existing NLP model for AD diagnosis with an accuracy of 88% [32]. I study the model stability through repeated experiments and find that the model is stable even though the training data is split randomly (standard deviation of accuracy = 0.0403; standard deviation of AUC = 0.0174). This affirms our proposed method is accurate and stable. This model can be used as a large-scale screening method for AD, as well as a complementary examination for doctors to detect AD.
comment: Peer-reviewed and published in Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2020). 7 pages, 5 figures
☆ LLM-ERM: Sample-Efficient Program Learning via LLM-Guided Search
We seek algorithms for program learning that are both sample-efficient and computationally feasible. Classical results show that targets admitting short program descriptions (e.g., with short ``python code'') can be learned with a ``small'' number of examples (scaling with the size of the code) via length-first program enumeration, but the search is exponential in description length. Consequently, Gradient-based training avoids this cost yet can require exponentially many samples on certain short-program families. To address this gap, we introduce LLM-ERM, a propose-and-verify framework that replaces exhaustive enumeration with an LLM-guided search over candidate programs while retaining ERM-style selection on held-out data. Specifically, we draw $k$ candidates with a pretrained reasoning-augmented LLM, compile and check each on the data, and return the best verified hypothesis, with no feedback, adaptivity, or gradients. Theoretically, we show that coordinate-wise online mini-batch SGD requires many samples to learn certain short programs. {\em Empirically, LLM-ERM solves tasks such as parity variants, pattern matching, and primality testing with as few as 200 samples, while SGD-trained transformers overfit even with 100,000 samples}. These results indicate that language-guided program synthesis recovers much of the statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond the reach of gradient-based training.
☆ Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
☆ Active Measuring in Reinforcement Learning With Delayed Negative Effects
Measuring states in reinforcement learning (RL) can be costly in real-world settings and may negatively influence future outcomes. We introduce the Actively Observable Markov Decision Process (AOMDP), where an agent not only selects control actions but also decides whether to measure the latent state. The measurement action reveals the true latent state but may have a negative delayed effect on the environment. We show that this reduced uncertainty may provably improve sample efficiency and increase the value of the optimal policy despite these costs. We formulate an AOMDP as a periodic partially observable MDP and propose an online RL algorithm based on belief states. To approximate the belief states, we further propose a sequential Monte Carlo method to jointly approximate the posterior of unknown static environment parameters and unobserved latent states. We evaluate the proposed algorithm in a digital health application, where the agent decides when to deliver digital interventions and when to assess users' health status through surveys.
☆ Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers
In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.
comment: 9 pages, 10 figures
☆ TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening ICDM 2025
As deep neural networks power increasingly critical applications, stealthy backdoor attacks, where poisoned training inputs trigger malicious model behaviour while appearing benign, pose a severe security risk. Many existing defences are vulnerable when attackers exploit subtle distance-based anomalies or when clean examples are scarce. To meet this challenge, we introduce TED++, a submanifold-aware framework that effectively detects subtle backdoors that evade existing defences. TED++ begins by constructing a tubular neighbourhood around each class's hidden-feature manifold, estimating its local ``thickness'' from a handful of clean activations. It then applies Locally Adaptive Ranking (LAR) to detect any activation that drifts outside the admissible tube. By aggregating these LAR-adjusted ranks across all layers, TED++ captures how faithfully an input remains on the evolving class submanifolds. Based on such characteristic ``tube-constrained'' behaviour, TED++ flags inputs whose LAR-based ranking sequences deviate significantly. Extensive experiments are conducted on benchmark datasets and tasks, demonstrating that TED++ achieves state-of-the-art detection performance under both adaptive-attack and limited-data scenarios. Remarkably, even with only five held-out examples per class, TED++ still delivers near-perfect detection, achieving gains of up to 14\% in AUROC over the next-best method. The code is publicly available at https://github.com/namle-w/TEDpp.
comment: Accepted by ICDM 2025
☆ Learning Human-Humanoid Coordination for Collaborative Object Carrying
Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids' complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
☆ Enhancing Time-Series Anomaly Detection by Integrating Spectral-Residual Bottom-Up Attention with Reservoir Computing
Reservoir computing (RC) establishes the basis for the processing of time-series data by exploiting the high-dimensional spatiotemporal response of a recurrent neural network to an input signal. In particular, RC trains only the output layer weights. This simplicity has drawn attention especially in Edge Artificial Intelligence (AI) applications. Edge AI enables time-series anomaly detection in real time, which is important because detection delays can lead to serious incidents. However, achieving adequate anomaly-detection performance with RC alone may require an unacceptably large reservoir on resource-constrained edge devices. Without enlarging the reservoir, attention mechanisms can improve accuracy, although they may require substantial computation and undermine the learning efficiency of RC. In this study, to improve the anomaly detection performance of RC without sacrificing learning efficiency, we propose a spectral residual RC (SR-RC) that integrates the spectral residual (SR) method - a learning-free, bottom-up attention mechanism - with RC. We demonstrated that SR-RC outperformed conventional RC and logistic-regression models based on values extracted by the SR method across benchmark tasks and real-world time-series datasets. Moreover, because the SR method, similarly to RC, is well suited for hardware implementation, SR-RC suggests a practical direction for deploying RC as Edge AI for time-series anomaly detection.
☆ Stable Prediction of Adverse Events in Medical Time-Series Data
Early event prediction (EEP) systems continuously estimate a patient's imminent risk to support clinical decision-making. For bedside trust, risk trajectories must be accurate and temporally stable, shifting only with new, relevant evidence. However, current benchmarks (a) ignore stability of risk scores and (b) evaluate mainly on tabular inputs, leaving trajectory behavior untested. To address this gap, we introduce CAREBench, an EEP benchmark that evaluates deployability using multi-modal inputs-tabular EHR, ECG waveforms, and clinical text-and assesses temporal stability alongside predictive accuracy. We propose a stability metric that quantifies short-term variability in per-patient risk and penalizes abrupt oscillations based on local-Lipschitz constants. CAREBench spans six prediction tasks such as sepsis onset and compares classical learners, deep sequence models, and zero-shot LLMs. Across tasks, existing methods, especially LLMs, struggle to jointly optimize accuracy and stability, with notably poor recall at high-precision operating points. These results highlight the need for models that produce evidence-aligned, stable trajectories to earn clinician trust in continuous monitoring settings. (Code: https://github.com/SeewonChoi/CAREBench.)
comment: 18 pages, 3 Figures
☆ Beyond a Single Perspective: Towards a Realistic Evaluation of Website Fingerprinting Attacks
Website Fingerprinting (WF) attacks exploit patterns in encrypted traffic to infer the websites visited by users, posing a serious threat to anonymous communication systems. Although recent WF techniques achieve over 90% accuracy in controlled experimental settings, most studies remain confined to single scenarios, overlooking the complexity of real-world environments. This paper presents the first systematic and comprehensive evaluation of existing WF attacks under diverse realistic conditions, including defense mechanisms, traffic drift, multi-tab browsing, early-stage detection, open-world settings, and few-shot scenarios. Experimental results show that many WF techniques with strong performance in isolated settings degrade significantly when facing other conditions. Since real-world environments often combine multiple challenges, current WF attacks are difficult to apply directly in practice. This study highlights the limitations of WF attacks and introduces a multidimensional evaluation framework, offering critical insights for developing more robust and practical WF attacks.
☆ Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation
Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.
☆ Nonparametric Data Attribution for Diffusion Models
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at https://github.com/sail-sg/NDA.
☆ CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions
Large language models have achieved remarkable success but remain largely black boxes with poorly understood internal mechanisms. To address this limitation, many researchers have proposed various interpretability methods including mechanistic analysis, probing classifiers, and activation visualization, each providing valuable insights from different perspectives. Building upon this rich landscape of complementary approaches, we introduce CAST (Compositional Analysis via Spectral Tracking), a probe-free framework that contributes a novel perspective by analyzing transformer layer functions through direct transformation matrix estimation and comprehensive spectral analysis. CAST offers complementary insights to existing methods by estimating the realized transformation matrices for each layer using Moore-Penrose pseudoinverse and applying spectral analysis with six interpretable metrics characterizing layer behavior. Our analysis reveals distinct behaviors between encoder-only and decoder-only models, with decoder models exhibiting compression-expansion cycles while encoder models maintain consistent high-rank processing. Kernel analysis further demonstrates functional relationship patterns between layers, with CKA similarity matrices clearly partitioning layers into three phases: feature extraction, compression, and specialization.
☆ Generalist vs Specialist Time Series Foundation Models: Investigating Potential Emergent Behaviors in Assessing Human Health Using PPG Signals
Foundation models are large-scale machine learning models that are pre-trained on massive amounts of data and can be adapted for various downstream tasks. They have been extensively applied to tasks in Natural Language Processing and Computer Vision with models such as GPT, BERT, and CLIP. They are now also increasingly gaining attention in time-series analysis, particularly for physiological sensing. However, most time series foundation models are specialist models - with data in pre-training and testing of the same type, such as Electrocardiogram, Electroencephalogram, and Photoplethysmogram (PPG). Recent works, such as MOMENT, train a generalist time series foundation model with data from multiple domains, such as weather, traffic, and electricity. This paper aims to conduct a comprehensive benchmarking study to compare the performance of generalist and specialist models, with a focus on PPG signals. Through an extensive suite of total 51 tasks covering cardiac state assessment, laboratory value estimation, and cross-modal inference, we comprehensively evaluate both models across seven dimensions, including win score, average performance, feature quality, tuning gain, performance variance, transferability, and scalability. These metrics jointly capture not only the models' capability but also their adaptability, robustness, and efficiency under different fine-tuning strategies, providing a holistic understanding of their strengths and limitations for diverse downstream scenarios. In a full-tuning scenario, we demonstrate that the specialist model achieves a 27% higher win score. Finally, we provide further analysis on generalization, fairness, attention visualizations, and the importance of training data choice.
☆ A Physics Prior-Guided Dual-Stream Attention Network for Motion Prediction of Elastic Bragg Breakwaters
Accurate motion response prediction for elastic Bragg breakwaters is critical for their structural safety and operational integrity in marine environments. However, conventional deep learning models often exhibit limited generalization capabilities when presented with unseen sea states. These deficiencies stem from the neglect of natural decay observed in marine systems and inadequate modeling of wave-structure interaction (WSI). To overcome these challenges, this study proposes a novel Physics Prior-Guided Dual-Stream Attention Network (PhysAttnNet). First, the decay bidirectional self-attention (DBSA) module incorporates a learnable temporal decay to assign higher weights to recent states, aiming to emulate the natural decay phenomenon. Meanwhile, the phase differences guided bidirectional cross-attention (PDG-BCA) module explicitly captures the bidirectional interaction and phase relationship between waves and the structure using a cosine-based bias within a bidirectional cross-computation paradigm. These streams are synergistically integrated through a global context fusion (GCF) module. Finally, PhysAttnNet is trained with a hybrid time-frequency loss that jointly minimizes time-domain prediction errors and frequency-domain spectral discrepancies. Comprehensive experiments on wave flume datasets demonstrate that PhysAttnNet significantly outperforms mainstream models. Furthermore,cross-scenario generalization tests validate the model's robustness and adaptability to unseen environments, highlighting its potential as a framework to develop predictive models for complex systems in ocean engineering.
☆ Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation
Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ. We study this problem through the lens of robust Markov decision processes (RMDPs), which optimize performance against adversarial transition dynamics. Our focus is the online setting, where the agent has only limited interaction with the environment, making sample efficiency and exploration especially critical. Policy optimization, despite its success in standard RL, remains theoretically and empirically underexplored in robust RL. To bridge this gap, we propose \textbf{D}istributionally \textbf{R}obust \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization algorithm (DR-RPO), a model-free online policy optimization method that learns robust policies with sublinear regret. To enable tractable optimization within the softmax policy class, DR-RPO incorporates reference-policy regularization, yielding RMDP variants that are doubly constrained in both transitions and policies. To scale to large state-action spaces, we adopt the $d$-rectangular linear MDP formulation and combine linear function approximation with an upper confidence bonus for optimistic exploration. We provide theoretical guarantees showing that policy optimization can achieve polynomial suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches. Finally, empirical results across diverse domains corroborate our theory and demonstrate the robustness of DR-RPO.
comment: 53 pages, 8 figures
☆ Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs
Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.
comment: 14 pages, 6 figures, 3 tables, and 1 algorithm
☆ Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.
comment: 14 pages, 11 figures
☆ When Flatness Does (Not) Guarantee Adversarial Robustness
Despite their empirical success, neural networks remain vulnerable to small, adversarial perturbations. A longstanding hypothesis suggests that flat minima, regions of low curvature in the loss landscape, offer increased robustness. While intuitive, this connection has remained largely informal and incomplete. By rigorously formalizing the relationship, we show this intuition is only partially correct: flatness implies local but not global adversarial robustness. To arrive at this result, we first derive a closed-form expression for relative flatness in the penultimate layer, and then show we can use this to constrain the variation of the loss in input space. This allows us to formally analyze the adversarial robustness of the entire network. We then show that to maintain robustness beyond a local neighborhood, the loss needs to curve sharply away from the data manifold. We validate our theoretical predictions empirically across architectures and datasets, uncovering the geometric structure that governs adversarial vulnerability, and linking flatness to model confidence: adversarial examples often lie in large, flat regions where the model is confidently wrong. Our results challenge simplified views of flatness and provide a nuanced understanding of its role in robustness.
☆ A novel Information-Driven Strategy for Optimal Regression Assessment
In Machine Learning (ML), a regression algorithm aims to minimize a loss function based on data. An assessment method in this context seeks to quantify the discrepancy between the optimal response for an input-output system and the estimate produced by a learned predictive model (the student). Evaluating the quality of a learned regressor remains challenging without access to the true data-generating mechanism, as no data-driven assessment method can ensure the achievability of global optimality. This work introduces the Information Teacher, a novel data-driven framework for evaluating regression algorithms with formal performance guarantees to assess global optimality. Our novel approach builds on estimating the Shannon mutual information (MI) between the input variables and the residuals and applies to a broad class of additive noise models. Through numerical experiments, we confirm that the Information Teacher is capable of detecting global optimality, which is aligned with the condition of zero estimation error with respect to the -- inaccessible, in practice -- true model, working as a surrogate measure of the ground truth assessment loss and offering a principled alternative to conventional empirical performance metrics.
Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization
Understanding the spectral properties of kernels offers a principled perspective on generalization and representation quality. While deep models achieve state-of-the-art accuracy in molecular property prediction, kernel methods remain widely used for their robustness in low-data regimes and transparent theoretical grounding. Despite extensive studies of kernel spectra in machine learning, systematic spectral analyses of molecular kernels are scarce. In this work, we provide the first comprehensive spectral analysis of kernel ridge regression on the QM9 dataset, molecular fingerprint, pretrained transformer-based, global and local 3D representations across seven molecular properties. Surprisingly, richer spectral features, measured by four different spectral metrics, do not consistently improve accuracy. Pearson correlation tests further reveal that for transformer-based and local 3D representations, spectral richness can even have a negative correlation with performance. We also implement truncated kernels to probe the relationship between spectrum and predictive performance: in many kernels, retaining only the top 2% of eigenvalues recovers nearly all performance, indicating that the leading eigenvalues capture the most informative features. Our results challenge the common heuristic that "richer spectra yield better generalization" and highlight nuanced relationships between representation, kernel features, and predictive performance. Beyond molecular property prediction, these findings inform how kernel and self-supervised learning methods are evaluated in data-limited scientific and real-world tasks.
comment: 14 pages, 5 figures, 3 tables, SI: 8 pages, 7 figures
☆ Incentive-Based Federated Learning
Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.
comment: 24 pages, 5 figures, chapter for edited book (Federated Learning: Foundations and Applications)
☆ Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation
Diffusion models excel at generation, but their latent spaces are not explicitly organized for interpretable control. We introduce ConDA (Contrastive Diffusion Alignment), a framework that applies contrastive learning within diffusion embeddings to align latent geometry with system dynamics. Motivated by recent advances showing that contrastive objectives can recover more disentangled and structured representations, ConDA organizes diffusion latents such that traversal directions reflect underlying dynamical factors. Within this contrastively structured space, ConDA enables nonlinear trajectory traversal that supports faithful interpolation, extrapolation, and controllable generation. Across benchmarks in fluid dynamics, neural calcium imaging, therapeutic neurostimulation, and facial expression, ConDA produces interpretable latent representations with improved controllability compared to linear traversals and conditioning-based baselines. These results suggest that diffusion latents encode dynamics-relevant structure, but exploiting this structure requires latent organization and traversal along the latent manifold.
☆ MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation
We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA's effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.
☆ ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.
☆ HyperAIRI: a plug-and-play algorithm for precise hyperspectral image reconstruction in radio interferometry
The next-generation radio-interferometric (RI) telescopes require imaging algorithms capable of forming high-resolution high-dynamic-range images from large data volumes spanning wide frequency bands. Recently, AIRI, a plug-and-play (PnP) approach taking the forward-backward algorithmic structure (FB), has demonstrated state-of-the-art performance in monochromatic RI imaging by alternating a data-fidelity step with a regularisation step via learned denoisers. In this work, we introduce HyperAIRI, its hyperspectral extension, underpinned by learned hyperspectral denoisers enforcing a power-law spectral model. For each spectral channel, the HyperAIRI denoiser takes as input its current image estimate, alongside estimates of its two immediate neighbouring channels and the spectral index map, and provides as output its associated denoised image. To ensure convergence of HyperAIRI, the denoisers are trained with a Jacobian regularisation enforcing non-expansiveness. To accommodate varying dynamic ranges, we assemble a shelf of pre-trained denoisers, each tailored to a specific dynamic range. At each HyperAIRI iteration, the spectral channels of the target image cube are updated in parallel using dynamic-range-matched denoisers from the pre-trained shelf. The denoisers are also endowed with a spatial image faceting functionality, enabling scalability to varied image sizes. Additionally, we formally introduce Hyper-uSARA, a variant of the optimisation-based algorithm HyperSARA, promoting joint sparsity across spectral channels via the l2,1-norm, also adopting FB. We evaluate HyperAIRI's performance on simulated and real observations. We showcase its superior performance compared to its optimisation-based counterpart Hyper-uSARA, CLEAN's hyperspectral variant in WSClean, and the monochromatic imaging algorithms AIRI and uSARA.
comment: 18 pages, 10 figures, submitted to MNRAS
☆ OCR-APT: Reconstructing APT Stories from Audit Logs using Subgraph Anomaly Detection and LLMs
Advanced Persistent Threats (APTs) are stealthy cyberattacks that often evade detection in system-level audit logs. Provenance graphs model these logs as connected entities and events, revealing relationships that are missed by linear log representations. Existing systems apply anomaly detection to these graphs but often suffer from high false positive rates and coarse-grained alerts. Their reliance on node attributes like file paths or IPs leads to spurious correlations, reducing detection robustness and reliability. To fully understand an attack's progression and impact, security analysts need systems that can generate accurate, human-like narratives of the entire attack. To address these challenges, we introduce OCR-APT, a system for APT detection and reconstruction of human-like attack stories. OCR-APT uses Graph Neural Networks (GNNs) for subgraph anomaly detection, learning behavior patterns around nodes rather than fragile attributes such as file paths or IPs. This approach leads to a more robust anomaly detection. It then iterates over detected subgraphs using Large Language Models (LLMs) to reconstruct multi-stage attack stories. Each stage is validated before proceeding, reducing hallucinations and ensuring an interpretable final report. Our evaluations on the DARPA TC3, OpTC, and NODLINK datasets show that OCR-APT outperforms state-of-the-art systems in both detection accuracy and alert interpretability. Moreover, OCR-APT reconstructs human-like reports that comprehensively capture the attack story.
☆ An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets
Hip fractures are a major cause of disability, mortality, and healthcare burden in older adults, underscoring the need for early risk assessment. However, commonly used tools such as the DXA T-score and FRAX often lack sensitivity and miss individuals at high risk, particularly those without prior fractures or with osteopenia. To address this limitation, we propose a sequential two-stage model that integrates clinical and imaging information to improve prediction accuracy. Using data from the Osteoporotic Fractures in Men Study (MrOS), the Study of Osteoporotic Fractures (SOF), and the UK Biobank, Stage 1 (Screening) employs clinical, demographic, and functional variables to estimate baseline risk, while Stage 2 (Imaging) incorporates DXA-derived features for refinement. The model was rigorously validated through internal and external testing, showing consistent performance and adaptability across cohorts. Compared to T-score and FRAX, the two-stage framework achieved higher sensitivity and reduced missed cases, offering a cost-effective and personalized approach for early hip fracture risk assessment. Keywords: Hip Fracture, Two-Stage Model, Risk Prediction, Sensitivity, DXA, FRAX
comment: 38 pages, 3 figures, 8 tables. This is a preprint version of the manuscript titled "An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets." The paper is currently under journal submission
☆ Finding geodesics with the Deep Ritz method
Geodesic problems involve computing trajectories between prescribed initial and final states to minimize a user-defined measure of distance, cost, or energy. They arise throughout physics and engineering -- for instance, in determining optimal paths through complex environments, modeling light propagation in refractive media, and the study of spacetime trajectories in control theory and general relativity. Despite their ubiquity, the scientific machine learning (SciML) community has given relatively little attention to investigating its methods in the context of these problems. In this work, we argue that given their simple geometry, variational structure, and natural nonlinearity, geodesic problems are particularly well-suited for the Deep Ritz method. We substantiate this claim with three numerical examples drawn from path planning, optics, and solid mechanics. Our goal is not to provide an exhaustive study of geodesic problems, but rather to identify a promising application of the Deep Ritz method and a fruitful direction for future SciML research.
☆ A simple mean field model of feature learning
Feature learning (FL), where neural networks adapt their internal representations during training, remains poorly understood. Using methods from statistical physics, we derive a tractable, self-consistent mean-field (MF) theory for the Bayesian posterior of two-layer non-linear networks trained with stochastic gradient Langevin dynamics (SGLD). At infinite width, this theory reduces to kernel ridge regression, but at finite width it predicts a symmetry breaking phase transition where networks abruptly align with target functions. While the basic MF theory provides theoretical insight into the emergence of FL in the finite-width regime, semi-quantitatively predicting the onset of FL with noise or sample size, it substantially underestimates the improvements in generalisation after the transition. We trace this discrepancy to a key mechanism absent from the plain MF description: \textit{self-reinforcing input feature selection}. Incorporating this mechanism into the MF theory allows us to quantitatively match the learning curves of SGLD-trained networks and provides mechanistic insight into FL.
☆ Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization
Reinforcement Learning (RL) enables agents to learn optimal decision-making strategies through interaction with an environment, yet training from scratch on complex tasks can be highly inefficient. Transfer learning (TL), widely successful in large language models (LLMs), offers a promising direction for enhancing RL efficiency by leveraging pre-trained models. This paper investigates policy transfer, a TL approach that initializes learning in a target RL task using a policy from a related source task, in the context of continuous-time linear quadratic regulators (LQRs) with entropy regularization. We provide the first theoretical proof of policy transfer for continuous-time RL, proving that a policy optimal for one LQR serves as a near-optimal initialization for closely related LQRs, while preserving the original algorithm's convergence rate. Furthermore, we introduce a novel policy learning algorithm for continuous-time LQRs that achieves global linear and local super-linear convergence. Our results demonstrate both theoretical guarantees and algorithmic benefits of transfer learning in continuous-time RL, addressing a gap in existing literature and extending prior work from discrete to continuous time settings. As a byproduct of our analysis, we derive the stability of a class of continuous-time score-based diffusion models via their connection with LQRs.
☆ Beyond PCA: Manifold Dimension Estimation via Local Graph Structure
Local principal component analysis (Local PCA) has proven to be an effective tool for estimating the intrinsic dimension of a manifold. More recently, curvature-adjusted PCA (CA-PCA) has improved upon this approach by explicitly accounting for the curvature of the underlying manifold, rather than assuming local flatness. Building on these insights, we propose a general framework for manifold dimension estimation that captures the manifold's local graph structure by integrating PCA with regression-based techniques. Within this framework, we introduce two representative estimators: quadratic embedding (QE) and total least squares (TLS). Experiments on both synthetic and real-world datasets demonstrate that these methods perform competitively with, and often outperform, state-of-the-art alternatives.
☆ Predicting the Unpredictable: Reproducible BiLSTM Forecasting of Incident Counts in the Global Terrorism Database (GTD)
We study short-horizon forecasting of weekly terrorism incident counts using the Global Terrorism Database (GTD, 1970--2016). We build a reproducible pipeline with fixed time-based splits and evaluate a Bidirectional LSTM (BiLSTM) against strong classical anchors (seasonal-naive, linear/ARIMA) and a deep LSTM-Attention baseline. On the held-out test set, the BiLSTM attains RMSE 6.38, outperforming LSTM-Attention (9.19; +30.6\%) and a linear lag-regression baseline (+35.4\% RMSE gain), with parallel improvements in MAE and MAPE. Ablations varying temporal memory, training-history length, spatial grain, lookback size, and feature groups show that models trained on long historical data generalize best; a moderate lookback (20--30 weeks) provides strong context; and bidirectional encoding is critical for capturing both build-up and aftermath patterns within the window. Feature-group analysis indicates that short-horizon structure (lagged counts and rolling statistics) contributes most, with geographic and casualty features adding incremental lift. We release code, configs, and compact result tables, and provide a data/ethics statement documenting GTD licensing and research-only use. Overall, the study offers a transparent, baseline-beating reference for GTD incident forecasting.
comment: 12 pages, 5 figures, 2 tables. Code & reproducibility: https://github.com/Davidavid45/Deep-Learning-in-Counterterrorism Data/ethics: GTD used under research-only terms; no raw GTD is redistributed
☆ A Simple Method for PMF Estimation on Large Supports
We study nonparametric estimation of a probability mass function (PMF) on a large discrete support, where the PMF is multi-modal and heavy-tailed. The core idea is to treat the empirical PMF as a signal on a line graph and apply a data-dependent low-pass filter. Concretely, we form a symmetric tri-diagonal operator, the path graph Laplacian perturbed with a diagonal matrix built from the empirical PMF, then compute the eigenvectors, corresponding to the smallest feq eigenvalues. Projecting the empirical PMF onto this low dimensional subspace produces a smooth, multi-modal estimate that preserves coarse structure while suppressing noise. A light post-processing step of clipping and re-normalizing yields a valid PMF. Because we compute the eigenpairs of a symmetric tridiagonal matrix, the computation is reliable and runs time and memory proportional to the support times the dimension of the desired low-dimensional supspace. We also provide a practical, data-driven rule for selecting the dimension based on an orthogonal-series risk estimate, so the method "just works" with minimal tuning. On synthetic and real heavy-tailed examples, the approach preserves coarse structure while suppressing sampling noise, compares favorably to logspline and Gaussian-KDE baselines in the intended regimes. However, it has known failure modes (e.g., abrupt discontinuities). The method is short to implement, robust across sample sizes, and suitable for automated pipelines and exploratory analysis at scale because of its reliability and speed.
☆ Towards Error Centric Intelligence I, Beyond Observational Learning
We argue that progress toward AGI is theory limited rather than data or scale limited. Building on the critical rationalism of Popper and Deutsch, we challenge the Platonic Representation Hypothesis. Observationally equivalent worlds can diverge under interventions, so observational adequacy alone cannot guarantee interventional competence. We begin by laying foundations, definitions of knowledge, learning, intelligence, counterfactual competence and AGI, and then analyze the limits of observational learning that motivate an error centric shift. We recast the problem as three questions about how explicit and implicit errors evolve under an agent's actions, which errors are unreachable within a fixed hypothesis space, and how conjecture and criticism expand that space. From these questions we propose Causal Mechanics, a mechanisms first program in which hypothesis space change is a first class operation and probabilistic structure is used when useful rather than presumed. We advance structural principles that make error discovery and correction tractable, including a differential Locality and Autonomy Principle for modular interventions, a gauge invariant form of Independent Causal Mechanisms for separability, and the Compositional Autonomy Principle for analogy preservation, together with actionable diagnostics. The aim is a scaffold for systems that can convert unreachable errors into reachable ones and correct them.
☆ Navigating the consequences of mechanical ventilation in clinical intensive care settings through an evolutionary game-theoretic framework
Identifying the effects of mechanical ventilation strategies and protocols in critical care requires analyzing data from heterogeneous patient-ventilator systems within the context of the clinical decision-making environment. This research develops a framework to help understand the consequences of mechanical ventilation (MV) and adjunct care decisions on patient outcome from observations of critical care patients receiving MV. Developing an understanding of and improving critical care respiratory management requires the analysis of existing secondary-use clinical data to generate hypotheses about advantageous variations and adaptations of current care. This work introduces a perspective of the joint patient-ventilator-care systems (so-called J6) to develop a scalable method for analyzing data and trajectories of these complex systems. To that end, breath behaviors are analyzed using evolutionary game theory (EGT), which generates the necessary quantitative precursors for deeper analysis through probabilistic and stochastic machinery such as reinforcement learning. This result is one step along the pathway toward MV optimization and personalization. The EGT-based process is analytically validated on synthetic data to reveal potential caveats before proceeding to real-world ICU data applications that expose complexities of the data-generating process J6. The discussion includes potential developments toward a state transition model for the simulating effects of MV decision using empirical and game-theoretic elements.
☆ Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis
Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges--abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.
comment: Under-submission
☆ Deep generative priors for 3D brain analysis
Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.
☆ Polarization based direction of arrival estimation using a radio interferometric array
Direction of arrival (DOA) estimation is mostly performed using specialized arrays that have carefully designed receiver spacing and layouts to match the operating frequency range. In contrast, radio interferometric arrays are designed to optimally sample the Fourier space data for making high quality images of the sky. Therefore, using existing radio interferometric arrays (with arbitrary geometry and wide frequency variation) for DOA estimation is practically infeasible except by using images made by such interferometers. In this paper, we focus on low cost DOA estimation without imaging, using a subset of a radio interferometric array, using a fraction of the data collected by the full array, and, enabling early determination of DOAs. The proposed method is suitable for transient and low duty cycle source detection. Moreover, the proposed method is an ideal follow-up step to online radio frequency interference (RFI) mitigation, enabling the early estimation of the DOA of the detected RFI.
☆ DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.
comment: NVIDIA-Tech Report
☆ Targeted Attacks and Defenses for Distributed Federated Learning in Vehicular Networks
In emerging networked systems, mobile edge devices such as ground vehicles and unmanned aerial system (UAS) swarms collectively aggregate vast amounts of data to make machine learning decisions such as threat detection in remote, dynamic, and infrastructure-constrained environments where power and bandwidth are scarce. Federated learning (FL) addresses these constraints and privacy concerns by enabling nodes to share local model weights for deep neural networks instead of raw data, facilitating more reliable decision-making than individual learning. However, conventional FL relies on a central server to coordinate model updates in each learning round, which imposes significant computational burdens on the central node and may not be feasible due to the connectivity constraints. By eliminating dependence on a central server, distributed federated learning (DFL) offers scalability, resilience to node failures, learning robustness, and more effective defense strategies. Despite these advantages, DFL remains vulnerable to increasingly advanced and stealthy cyberattacks. In this paper, we design sophisticated targeted training data poisoning and backdoor (Trojan) attacks, and characterize the emerging vulnerabilities in a vehicular network. We analyze how DFL provides resilience against such attacks compared to individual learning and present effective defense mechanisms to further strengthen DFL against the emerging cyber threats.
☆ PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models
As Large Language Models (LLMs) gain traction across critical domains, ensuring secure and trustworthy training processes has become a major concern. Backdoor attacks, where malicious actors inject hidden triggers into training data, are particularly insidious and difficult to detect. Existing post-training verification solutions like Proof-of-Learning are impractical for LLMs due to their requirement for full retraining, lack of robustness against stealthy manipulations, and inability to provide early detection during training. Early detection would significantly reduce computational costs. To address these limitations, we introduce Proof-of-Training Steps, a verification protocol that enables an independent auditor (Alice) to confirm that an LLM developer (Bob) has followed the declared training recipe, including data batches, architecture, and hyperparameters. By analyzing the sensitivity of the LLMs' language modeling head (LM-Head) to input perturbations, our method can expose subtle backdoor injections or deviations in training. Even with backdoor triggers in up to 10 percent of the training data, our protocol significantly reduces the attacker's ability to achieve a high attack success rate (ASR). Our method enables early detection of attacks at the injection step, with verification steps being 3x faster than training steps. Our results highlight the protocol's potential to enhance the accountability and security of LLM development, especially against insider threats.
comment: 10 pages, 6 figures, 1 table. Accepted for presentation at FLLM 2025 (Vienna, Nov 2025)
☆ Operator Flow Matching for Timeseries Forecasting
Forecasting high-dimensional, PDE-governed dynamics remains a core challenge for generative modeling. Existing autoregressive and diffusion-based approaches often suffer cumulative errors and discretisation artifacts that limit long, physically consistent forecasts. Flow matching offers a natural alternative, enabling efficient, deterministic sampling. We prove an upper bound on FNO approximation error and propose TempO, a latent flow matching model leveraging sparse conditioning with channel folding to efficiently process 3D spatiotemporal fields using time-conditioned Fourier layers to capture multi-scale modes with high fidelity. TempO outperforms state-of-the-art baselines across three benchmark PDE datasets, and spectral analysis further demonstrates superior recovery of multi-scale dynamics, while efficiency studies highlight its parameter- and memory-light design compared to attention-based or convolutional regressors.
comment: Preprint
☆ OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
Real-world settings where language models (LMs) are deployed -- in domains spanning healthcare, finance, and other forms of knowledge work -- require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce OpenEstimate, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in sampling strategy, reasoning effort, or prompt design. The OpenEstimate benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.
☆ Online Correlation Clustering: Simultaneously Optimizing All $\ell_p$-norms
The $\ell_p$-norm objectives for correlation clustering present a fundamental trade-off between minimizing total disagreements (the $\ell_1$-norm) and ensuring fairness to individual nodes (the $\ell_\infty$-norm). Surprisingly, in the offline setting it is possible to simultaneously approximate all $\ell_p$-norms with a single clustering. Can this powerful guarantee be achieved in an online setting? This paper provides the first affirmative answer. We present a single algorithm for the online-with-a-sample (AOS) model that, given a small constant fraction of the input as a sample, produces one clustering that is simultaneously $O(\log^4 n)$-competitive for all $\ell_p$-norms with high probability, $O(\log n)$-competitive for the $\ell_\infty$-norm with high probability, and $O(1)$-competitive for the $\ell_1$-norm in expectation. This work successfully translates the offline "all-norms" guarantee to the online world. Our setting is motivated by a new hardness result that demonstrates a fundamental separation between these objectives in the standard random-order (RO) online model. Namely, while the $\ell_1$-norm is trivially $O(1)$-approximable in the RO model, we prove that any algorithm in the RO model for the fairness-promoting $\ell_\infty$-norm must have a competitive ratio of at least $\Omega(n^{1/3})$. This highlights the necessity of a different beyond-worst-case model. We complement our algorithm with lower bounds, showing our competitive ratios for the $\ell_1$- and $\ell_\infty$- norms are nearly tight in the AOS model.
comment: 66 pages
☆ Physics-informed data-driven machine health monitoring for two-photon lithography
Two-photon lithography (TPL) is a sophisticated additive manufacturing technology for creating three-dimensional (3D) micro- and nano-structures. Maintaining the health of TPL systems is critical for ensuring consistent fabrication quality. Current maintenance practices often rely on experience rather than informed monitoring of machine health, resulting in either untimely maintenance that causes machine downtime and poor-quality fabrication, or unnecessary maintenance that leads to inefficiencies and avoidable downtime. To address this gap, this paper presents three methods for accurate and timely monitoring of TPL machine health. Through integrating physics-informed data-driven predictive models for structure dimensions with statistical approaches, the proposed methods are able to handle increasingly complex scenarios featuring different levels of generalizability. A comprehensive experimental dataset that encompasses six process parameter combinations and six structure dimensions under two machine health conditions was collected to evaluate the effectiveness of the proposed approaches. Across all test scenarios, the approaches are shown to achieve high accuracies, demonstrating excellent effectiveness, robustness, and generalizability. These results represent a significant step toward condition-based maintenance for TPL systems.
☆ Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models
Widespread LLM adoption has introduced characteristic repetitive phraseology, termed ``slop,'' which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000$\times$ more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90\% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.
comment: 11 pages + appendices, 16 figures
☆ The Minimax Lower Bound of Kernel Stein Discrepancy Estimation
Kernel Stein discrepancies (KSDs) have emerged as a powerful tool for quantifying goodness-of-fit over the last decade, featuring numerous successful applications. To the best of our knowledge, all existing KSD estimators with known rate achieve $\sqrt n$-convergence. In this work, we present two complementary results (with different proof strategies), establishing that the minimax lower bound of KSD estimation is $n^{-1/2}$ and settling the optimality of these estimators. Our first result focuses on KSD estimation on $\mathbb R^d$ with the Langevin-Stein operator; our explicit constant for the Gaussian kernel indicates that the difficulty of KSD estimation may increase exponentially with the dimensionality $d$. Our second result settles the minimax lower bound for KSD estimation on general domains.
☆ Learn to Change the World: Multi-level Reinforcement Learning with Model-Changing Actions
Reinforcement learning usually assumes a given or sometimes even fixed environment in which an agent seeks an optimal policy to maximize its long-term discounted reward. In contrast, we consider agents that are not limited to passive adaptations: they instead have model-changing actions that actively modify the RL model of world dynamics itself. Reconfiguring the underlying transition processes can potentially increase the agents' rewards. Motivated by this setting, we introduce the multi-layer configurable time-varying Markov decision process (MCTVMDP). In an MCTVMDP, the lower-level MDP has a non-stationary transition function that is configurable through upper-level model-changing actions. The agent's objective consists of two parts: Optimize the configuration policies in the upper-level MDP and optimize the primitive action policies in the lower-level MDP to jointly improve its expected long-term reward.
☆ Internalizing World Models via Self-Play Finetuning for Agentic RL
Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k--the probability that at least one of (k) sampled trajectories succeeds--drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.
☆ IQNN-CS: Interpretable Quantum Neural Network for Credit Scoring
Credit scoring is a high-stakes task in financial services, where model decisions directly impact individuals' access to credit and are subject to strict regulatory scrutiny. While Quantum Machine Learning (QML) offers new computational capabilities, its black-box nature poses challenges for adoption in domains that demand transparency and trust. In this work, we present IQNN-CS, an interpretable quantum neural network framework designed for multiclass credit risk classification. The architecture combines a variational QNN with a suite of post-hoc explanation techniques tailored for structured data. To address the lack of structured interpretability in QML, we introduce Inter-Class Attribution Alignment (ICAA), a novel metric that quantifies attribution divergence across predicted classes, revealing how the model distinguishes between credit risk categories. Evaluated on two real-world credit datasets, IQNN-CS demonstrates stable training dynamics, competitive predictive performance, and enhanced interpretability. Our results highlight a practical path toward transparent and accountable QML models for financial decision-making.
comment: Accepted for oral presentation at QUEST-IS'25. To appear in Springer proceedings
☆ Comprehensive language-image pre-training for 3D medical image understanding
Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.
☆ Composition-Grounded Instruction Synthesis for Visual Reasoning
Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
☆ AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport
Flow-based Generative Models (FGMs) effectively transform noise into complex data distributions. Incorporating Optimal Transport (OT) to couple noise and data during FGM training has been shown to improve the straightness of flow trajectories, enabling more effective inference. However, existing OT-based methods estimate the OT plan using (mini-)batches of sampled noise and data points, which limits their scalability to large and high-dimensional datasets in FGMs. This paper introduces AlignFlow, a novel approach that leverages Semi-Discrete Optimal Transport (SDOT) to enhance the training of FGMs by establishing an explicit, optimal alignment between noise distribution and data points with guaranteed convergence. SDOT computes a transport map by partitioning the noise space into Laguerre cells, each mapped to a corresponding data point. During FGM training, i.i.d. noise samples are paired with data points via the SDOT map. AlignFlow scales well to large datasets and model architectures with negligible computational overhead. Experimental results show that AlignFlow improves the performance of a wide range of state-of-the-art FGM algorithms and can be integrated as a plug-and-play component. Code is available at: https://github.com/konglk1203/AlignFlow.
comment: Submitted for peer review on Sep 24, 2025. Note: chairs and reviewers can see and bid on our submission since Sep 28, 2025
☆ The Coverage Principle: How Pre-training Enables Post-Training
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emph{the coverage principle}, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emph{coverage generalizes faster than cross entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.
♻ ☆ MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control
MimicKit is an open-source framework for training motion controllers using motion imitation and reinforcement learning. The codebase provides implementations of commonly-used motion-imitation techniques and RL algorithms. This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures. The codebase is designed to be modular and easily configurable, enabling convenient modification and extension to new characters and tasks. The open-source codebase is available at: https://github.com/xbpeng/MimicKit.
♻ ☆ Training LLM Agents to Empower Humans
Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human attain their objectives. Additionally, these methods often require costly explicit human feedback to provide a training signal. We propose a new approach to tuning assistive language models based on maximizing the human's empowerment, their ability to effect desired changes in the environment. Our empowerment-maximizing method, Empower, only requires offline text data, providing a self-supervised method for fine-tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18-person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate and 38% fewer suggestions. Additionally, we introduce a new environment for evaluating multi-turn code assistance using simulated humans. Using this environment, we show that agents trained with Empower increase the success rate of a simulated human programmer on challenging coding questions by an average of 192% over an SFT baseline. With this empowerment objective, we provide a framework for useful aligned AI agents at scale using only offline data without the need for any additional human feedback or verifiable rewards.
♻ ☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
♻ ☆ Thompson Sampling via Fine-Tuning of LLMs
Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality--a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. We demonstrate that online fine-tuning significantly improves sample efficiency, with negligible impact on computational efficiency.
♻ ☆ Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.
♻ ☆ Tensor Logic: The Language of AI
Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP and Prolog lack scalability and support for learning. This paper proposes tensor logic, a language that solves these problems by unifying neural and symbolic AI at a fundamental level. The sole construct in tensor logic is the tensor equation, based on the observation that logical rules and Einstein summation are essentially the same operation, and all else can be reduced to them. I show how to elegantly implement key forms of neural, symbolic and statistical AI in tensor logic, including transformers, formal reasoning, kernel machines and graphical models. Most importantly, tensor logic makes new directions possible, such as sound reasoning in embedding space. This combines the scalability and learnability of neural networks with the reliability and transparency of symbolic reasoning, and is potentially a basis for the wider adoption of AI.
comment: 17 pages, 0 figures
♻ ☆ SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.
comment: Project page: https://research.nvidia.com/labs/sil/projects/simuli
♻ ☆ Are Large Reasoning Models Interruptible?
Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information. Project Page: http://dynamic-lm.github.io/
comment: Project Page: http://dynamic-lm.github.io
♻ ☆ The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization NeurIPS
As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML$.$ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML$.$ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML$.$ENERGY Benchmark. We then highlight results from the early 2025 iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The ML$.$ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.
comment: NeurIPS D&B 2025 (Spotlight) | Benchmark: https://github.com/ml-energy/benchmark | Leaderboard: https://ml.energy/leaderboard
♻ ☆ Why is Your Language Model a Poor Implicit Reward Model?
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
comment: Code available at https://github.com/princeton-pli/exrm-vs-imrm
♻ ☆ GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data
Although data that can be naturally represented as graphs is widespread in real-world applications across diverse industries, popular graph ML benchmarks for node property prediction only cover a surprisingly narrow set of data domains, and graph neural networks (GNNs) are often evaluated on just a few academic citation networks. This issue is particularly pressing in light of the recent growing interest in designing graph foundation models. These models are supposed to be able to transfer to diverse graph datasets from different domains, and yet the proposed graph foundation models are often evaluated on a very limited set of datasets from narrow applications. To alleviate this issue, we introduce GraphLand: a benchmark of 14 diverse graph datasets for node property prediction from a range of different industrial applications. GraphLand allows evaluating graph ML models on a wide range of graphs with diverse sizes, structural characteristics, and feature sets, all in a unified setting. Further, GraphLand allows investigating such previously underexplored research questions as how realistic temporal distributional shifts under transductive and inductive settings influence graph ML model performance. To mimic realistic industrial settings, we use GraphLand to compare GNNs with gradient-boosted decision trees (GBDT) models that are popular in industrial applications and show that GBDTs provided with additional graph-based input features can sometimes be very strong baselines. Further, we evaluate currently available general-purpose graph foundation models and find that they fail to produce competitive results on our proposed datasets.
♻ ☆ VALID-Mol: a Systematic Framework for Validated LLM-Assisted Molecular Design
Large Language Models demonstrate substantial promise for advancing scientific discovery, yet their deployment in disciplines demanding factual precision and specialized domain constraints presents significant challenges. Within molecular design for pharmaceutical development, these models can propose innovative molecular modifications but frequently generate chemically infeasible structures. We introduce VALID-Mol, a comprehensive framework that integrates chemical validation with LLM-driven molecular design, achieving an improvement in valid chemical structure generation from 3% to 83%. Our methodology synthesizes systematic prompt optimization, automated chemical verification, and domain-adapted fine-tuning to ensure dependable generation of synthesizable molecules with enhanced properties. Our contribution extends beyond implementation details to provide a transferable methodology for scientifically-constrained LLM applications with measurable reliability enhancements. Computational analyses indicate our framework generates promising synthesis candidates with up to 17-fold predicted improvements in target binding affinity while preserving synthetic feasibility.
comment: 6 pages, 1 figure, 1 algorithm, 5 tables, to be published in ISPACS 2025, unabridged version exists as arXiv:2506.23339v1
♻ ☆ REX: Causal discovery based on machine learning and explainability techniques
Explainable Artificial Intelligence (XAI) techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive the causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce ReX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that ReX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, ReX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase ReX's effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, ReX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures.
comment: 22 pages, 30 figures, Published in Elsevier's Pattern Recognition
♻ ☆ Strategyproof Reinforcement Learning from Human Feedback NeurIPS 2025
We study Reinforcement Learning from Human Feedback (RLHF) in settings where multiple labelers may strategically misreport feedback to steer the learned policy toward their own preferences. We show that existing RLHF algorithms, including recent pluralistic methods, are not strategyproof, and that even a single strategic labeler can cause arbitrarily large misalignment with social welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, where $k$ is the number of labelers. This suggests a fundamental trade-off between incentive alignment (ensuring labelers report truthfully) and policy alignment (maximizing social welfare). To address this, we propose the Pessimistic Median of MLEs algorithm, which, under appropriate policy coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of labelers and samples increases. Our results apply to both contextual bandits and Markov decision processes.
comment: To appear at NeurIPS 2025
♻ ☆ Approximation Rates for Shallow ReLU$^k$ Neural Networks on Sobolev Spaces via the Radon Transform
Let $\Omega\subset \mathbb{R}^d$ be a bounded domain. We consider the problem of how efficiently shallow neural networks with the ReLU$^k$ activation function can approximate functions from Sobolev spaces $W^s(L_p(\Omega))$ with error measured in the $L_q(\Omega)$-norm. Utilizing the Radon transform and recent results from discrepancy theory, we provide a simple proof of nearly optimal approximation rates in a variety of cases, including when $q\leq p$, $p\geq 2$, and $s \leq k + (d+1)/2$. The rates we derive are optimal up to logarithmic factors, and significantly generalize existing results. An interesting consequence is that the adaptivity of shallow ReLU$^k$ neural networks enables them to obtain optimal approximation rates for smoothness up to order $s = k + (d+1)/2$, even though they represent piecewise polynomials of fixed degree $k$.
♻ ☆ SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
We introduce SteeringSafety, a systematic framework for evaluating representation steering methods across seven safety perspectives spanning 17 datasets. While prior work highlights general capabilities of representation steering, we systematically explore safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. Our framework provides modularized building blocks for state-of-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements like conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B reveal that strong steering performance depends critically on pairing of method, model, and specific perspective. DIM shows consistent effectiveness, but all methods exhibit substantial entanglement: social behaviors show highest vulnerability (reaching degradation as high as 76%), jailbreaking often compromises normative judgment, and hallucination steering unpredictably shifts political views. Our findings underscore the critical need for holistic safety evaluations.
♻ ☆ TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference
Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.
♻ ☆ Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval
Associative memory models, such as Hopfield networks and their modern variants, have garnered renewed interest due to advancements in memory capacity and connections with self-attention in transformers. In this work, we introduce a unified framework-Hopfield-Fenchel-Young networks-which generalizes these models to a broader family of energy functions. Our energies are formulated as the difference between two Fenchel-Young losses: one, parameterized by a generalized entropy, defines the Hopfield scoring mechanism, while the other applies a post-transformation to the Hopfield output. By utilizing Tsallis and norm entropies, we derive end-to-end differentiable update rules that enable sparse transformations, uncovering new connections between loss margins, sparsity, and exact retrieval of single memory patterns. We further extend this framework to structured Hopfield networks using the SparseMAP transformation, allowing the retrieval of pattern associations rather than a single pattern. Our framework unifies and extends traditional and modern Hopfield networks and provides an energy minimization perspective for widely used post-transformations like $\ell_2$-normalization and layer normalization-all through suitable choices of Fenchel-Young losses and by using convex analysis as a building block. Finally, we validate our Hopfield-Fenchel-Young networks on diverse memory recall tasks, including free and sequential recall. Experiments on simulated data, image retrieval, multiple instance learning, and text rationalization demonstrate the effectiveness of our approach.
comment: 51 pages, 14 figures. arXiv admin note: text overlap with arXiv:2402.13725
♻ ☆ Efficient & Correct Predictive Equivalence for Decision Trees
The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. The approach of McTavish et al. consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that the QM method may incorrectly decide predictive equivalence, if two key constraints are not respected, and one may be difficult to formally guarantee. Third, the paper shows that any of the problems to which the smallest DNF representation has been applied to can be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.
♻ ☆ Thinker: Learning to Think Fast and Slow
Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.
comment: 23 pages
♻ ☆ Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.
♻ ☆ Byzantine Failures Harm the Generalization of Robust Distributed Learning Algorithms More Than Data Poisoning
Robust distributed learning algorithms aim to maintain reliable performance despite the presence of misbehaving workers. Such misbehaviors are commonly modeled as $\textit{Byzantine failures}$, allowing arbitrarily corrupted communication, or as $\textit{data poisoning}$, a weaker form of corruption restricted to local training data. While prior work shows similar optimization guarantees for both models, an important question remains: $\textit{How do these threat models impact generalization?}$ Empirical evidence suggests a gap, yet it remains unclear whether it is unavoidable or merely an artifact of suboptimal attacks. We show, for the first time, a fundamental gap in generalization guarantees between the two threat models: Byzantine failures yield strictly worse rates than those achievable under data poisoning. Our findings leverage a tight algorithmic stability analysis of robust distributed learning. Specifically, we prove that: $\textit{(i)}$ under data poisoning, the uniform algorithmic stability of an algorithm with optimal optimization guarantees degrades by an additive factor of $\varTheta ( \frac{f}{n-f} )$, with $f$ out of $n$ workers misbehaving; whereas $\textit{(ii)}$ under Byzantine failures, the degradation is in $\Omega \big( \sqrt{ \frac{f}{n-2f}} \big)$.
♻ ☆ Adapting Noise to Data: Generative Flows from 1D Processes
We introduce a general framework for constructing generative models using one-dimensional noising processes. Beyond diffusion processes, we outline examples that demonstrate the flexibility of our approach. Motivated by this, we propose a novel framework in which the 1D processes themselves are learnable, achieved by parameterizing the noise distribution through quantile functions that adapt to the data. Our construction integrates seamlessly with standard objectives, including Flow Matching and consistency models. Learning quantile-based noise naturally captures heavy tails and compact supports when present. Numerical experiments highlight both the flexibility and the effectiveness of our method.
♻ ☆ Merge-of-Thought Distillation
Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite the practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different "best teachers," and even for the same student, the best teacher can vary across datasets. Therefore, to unify multiple teachers' reasoning abilities into a student to overcome conflicts among various teachers' supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics while reducing catastrophic forgetting, and shows robustness to distribution-shifted and peer-level teachers. Finally, we have demonstrated MoT possesses consensus CoT by eliminating teacher-specific inductive biases and inter-teacher conflicts while repeatedly reinforcing the learning of consensus reasoning features. These results position MoT as a simple, effective route to efficiently distilling long CoT capabilities from diverse teachers into compact students.
♻ ☆ LLM-guided Chemical Process Optimization with a Multi-Agent Approach
Chemical process optimization maximizes production efficiency and economic performance, but optimization algorithms, including gradient-based solvers, numerical methods, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable. We present a multi-agent LLM framework that autonomously infers operating constraints from minimal process descriptions, then collaboratively guides optimization. Our AutoGen-based framework employs OpenAI's o3 model with specialized agents for constraint generation, parameter validation, simulation, and optimization guidance. Through autonomous constraint generation and iterative multi-agent optimization, the framework eliminates the need for predefined operational bounds. Validated on hydrodealkylation across cost, yield, and yield-to-cost ratio metrics, the framework achieved competitive performance with conventional methods while reducing wall-time 31-fold relative to grid search, converging in under 20 minutes. The reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs and applying domain-informed heuristics. Unlike conventional methods requiring predefined constraints, our approach uniquely combines autonomous constraint generation with interpretable parameter exploration. Model comparison reveals reasoning-capable architectures (o3, o1) are essential for successful optimization, while standard models fail to converge. This approach is particularly valuable for emerging processes and retrofit applications where operational constraints are poorly characterized or unavailable.
comment: 16 pages (main manuscript without references), 4 figures
♻ ☆ Say My Name: a Model's Bias Discovery Framework
In the last few years, due to the broad applicability of deep learning to downstream tasks and end-to-end training capabilities, increasingly more concerns about potential biases to specific, non-representative patterns have been raised. Many works focusing on unsupervised debiasing usually leverage the tendency of deep models to learn ``easier'' samples, for example by clustering the latent space to obtain bias pseudo-labels. However, the interpretation of such pseudo-labels is not trivial, especially for a non-expert end user, as it does not provide semantic information about the bias features. To address this issue, we introduce ``Say My Name'' (SaMyNa), the first tool to identify biases within deep models semantically. Unlike existing methods, our approach focuses on biases learned by the model. Our text-based pipeline enhances explainability and supports debiasing efforts: applicable during either training or post-hoc validation, our method can disentangle task-related information and proposes itself as a tool to analyze biases. Evaluation on traditional benchmarks demonstrates its effectiveness in detecting biases and even disclaiming them, showcasing its broad applicability for model diagnosis.
♻ ☆ ConDiSim: Conditional Diffusion Models for Simulation Based Inference
We present a conditional diffusion model - ConDiSim, for simulation-based inference of complex systems with intractable likelihoods. ConDiSim leverages denoising diffusion probabilistic models to approximate posterior distributions, consisting of a forward process that adds Gaussian noise to parameters, and a reverse process learning to denoise, conditioned on observed data. This approach effectively captures complex dependencies and multi-modalities within posteriors. ConDiSim is evaluated across ten benchmark problems and two real-world test problems, where it demonstrates effective posterior approximation accuracy while maintaining computational efficiency and stability in model training. ConDiSim offers a robust and extensible framework for simulation-based inference, particularly suitable for parameter inference workflows requiring fast inference methods.
♻ ☆ Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.
comment: Project page: http://james-oldfield.github.io/tpc
♻ ☆ Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning NeurIPS 2025
Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.
comment: 51 pages, 33 figures, Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ The Principle of Uncertain Maximum Entropy
The Principle of Maximum Entropy is a rigorous technique for estimating an unknown distribution given partial information while simultaneously minimizing bias. However, an important requirement for applying the principle is that the available information be provided error-free (Jaynes 1982). We relax this requirement using a memoryless communication channel as a framework to derive a new, more general principle. We show our new principle provides an upper bound on the entropy of the unknown distribution and the amount of information lost due to the use of a given communications channel is unknown unless the unknown distribution's entropy is also known. Using our new principle we provide a new interpretation of the classic principle and experimentally show its performance relative to the classic principle and other generally applicable solutions. Finally, we present a simple algorithm for solving our new principle and an approximation useful when samples are limited.
♻ ☆ Disentangled and Self-Explainable Node Representation Learning
Node representations, or embeddings, are low-dimensional vectors that capture node properties, typically learned through unsupervised structural similarity objectives or supervised tasks. While recent efforts have focused on explaining graph model decisions, the interpretability of unsupervised node embeddings remains underexplored. To bridge this gap, we introduce DiSeNE (Disentangled and Self-Explainable Node Embedding), a framework that generates self-explainable embeddings in an unsupervised manner. Our method employs disentangled representation learning to produce dimension-wise interpretable embeddings, where each dimension is aligned with distinct topological structure of the graph. We formalize novel desiderata for disentangled and interpretable embeddings, which drive our new objective functions, optimizing simultaneously for both interpretability and disentanglement. Additionally, we propose several new metrics to evaluate representation quality and human interpretability. Extensive experiments across multiple benchmark datasets demonstrate the effectiveness of our approach.
comment: TMLR 2025
♻ ☆ Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization
Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{R\'enyi sharpness}, which is defined as the negative R\'enyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{R\'enyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (R\'enyi) entropy. To rigorously establish the relationship between generalization and (R\'enyi) sharpness, we provide several generalization bounds in terms of R\'enyi sharpness, by taking advantage of the reparametrization invariance property of R\'enyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the R\'enyi sharpness and generalization. Moreover, we propose to use a variant of R\'enyi Sharpness as regularizer during training, i.e., R\'enyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.
♻ ☆ Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models
As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.
♻ ☆ Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos ICCV 2025
Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.
comment: ICCV 2025. Project page: https://chenyi99.github.io/moto/
♻ ☆ Nearly Minimax Optimal Regret for Multinomial Logistic Bandit NeurIPS 2024
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
comment: Accepted at NeurIPS 2024
♻ ☆ Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents
We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper's code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent's effectiveness in creating reliable and capable paper agents through in-depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single-cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper's results and can correctly carry out novel user queries. Paper2Agent automatically created AI co-scientist that identified new splicing variant associated with ADHD risk. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co-scientists.
♻ ☆ Adaptive Budget Allocation for Orthogonal-Subspace Adapter Tuning in LLMs Continual Learning
Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks. Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers. Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches' practical application in CL scenarios. To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in an end-to-end training stage. Specifically, OA-Adapter introduces a dynamic bottleneck dimension adaptation mechanism that simultaneously allocates an efficient parameter budget and optimizes task objectives without misalignment.To effectively preserve previously acquired knowledge while coordinating with the dynamic budget allocation, orthogonal constraints are applied specifically between the parameter subspace of the current task and the dynamically allocated parameter subspaces of historical tasks. Experimental results on continual learning benchmarks demonstrate that OA-Adapter outperforms state-of-the-art methods in both accuracy and parameter efficiency. OA-Adapter achieves higher average accuracy while using 58.5% fewer parameters on the standard CL benchmark, and maintains its advantages on two larger benchmarks comprising 15 tasks.
♻ ☆ Role-Aware Multi-modal federated learning system for detecting phishing webpages
We present a federated, multi-modal phishing website detector that supports URL, HTML, and IMAGE inputs without binding clients to a fixed modality at inference: any client can invoke any modality head trained elsewhere. Methodologically, we propose role-aware bucket aggregation on top of FedProx, inspired by Mixture-of-Experts and FedMM. We drop learnable routing and use hard gating (selecting the IMAGE/HTML/URL expert by sample modality), enabling separate aggregation of modality-specific parameters to isolate cross-embedding conflicts and stabilize convergence. On TR-OP, the Fusion head reaches Acc 97.5% with FPR 2.4% across two data types; on the image subset (ablation) it attains Acc 95.5% with FPR 5.9%. For text, we use GraphCodeBERT for URLs and an early three-way embedding for raw, noisy HTML. On WebPhish (HTML) we obtain Acc 96.5% / FPR 1.8%; on TR-OP (raw HTML) we obtain Acc 95.1% / FPR 4.6%. Results indicate that bucket aggregation with hard-gated experts enables stable federated training under strict privacy, while improving the usability and flexibility of multi-modal phishing detection.
comment: 22 pages, 9 figures
♻ ☆ Offline Reinforcement Learning via Inverse Optimization
Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.
comment: preprint
♻ ☆ Subspace-Boosted Model Merging
Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on both vision and language benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to quantify task similarity, offering a new interpretable perspective on model merging.
comment: 22 pages (main + supp)
♻ ☆ The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Large Language Models (LLMs) are increasingly integral to information dissemination and decision-making processes. Given their growing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propagation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the inherent political leanings of these models. Subsequently, persona prompting with the PCT is used to explore explicit stereotypes across various social dimensions. In a final step, implicit stereotypes are uncovered by evaluating models with multilingual versions of the PCT. Key findings reveal a consistent left-leaning political alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those identified via explicit persona prompting. Interestingly, for most models, implicit and explicit stereotypes show a notable alignment, suggesting a degree of transparency or "awareness" regarding their inherent biases. This study underscores the complex interplay of political bias and stereotypes in LLMs.
♻ ☆ ECG-Soup: Harnessing Multi-Layer Synergy for ECG Foundation Models
Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications.
♻ ☆ On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.
comment: 14 pages, 3 figures
♻ ☆ Adaptive Set-Mass Calibration with Conformal Prediction
Reliable probabilities are critical in high-risk applications, yet common calibration criteria (confidence, class-wise) are only necessary for full distributional calibration, and post-hoc methods often lack distribution-free guarantees. We propose a set-based notion of calibration, cumulative mass calibration, and a corresponding empirical error measure: the Cumulative Mass Calibration Error (CMCE). We develop a new calibration procedure that starts with conformal prediction to obtain a set of labels that gives the desired coverage. We then instantiate two simple post-hoc calibrators: a mass normalization and a temperature scaling-based rule, tuned to the conformal constraint. On multi-class image benchmarks, especially with a large number of classes, our methods consistently improve CMCE and standard metrics (ECE, cw-ECE, MCE) over baselines, delivering a practical, scalable framework with theoretical guarantees.
♻ ☆ Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification
Lightweight convolutional and transformer-based networks are increasingly preferred for real-time image classification, especially on resource-constrained devices. This study evaluates the impact of hyperparameter optimization on the accuracy and deployment feasibility of seven modern lightweight architectures: ConvNeXt-T, EfficientNetV2-S, MobileNetV3-L, MobileViT v2 (S/XS), RepVGG-A2, and TinyViT-21M, trained on a class-balanced subset of 90,000 images from ImageNet-1K. Under standardized training settings, this paper investigates the influence of learning rate schedules, augmentation, optimizers, and initialization on model performance. Inference benchmarks are performed using an NVIDIA L40s GPU with batch sizes ranging from 1 to 512, capturing latency and throughput in real-time conditions. This work demonstrates that controlled hyperparameter variation significantly alters convergence dynamics in lightweight CNN and transformer backbones, providing insight into stability regions and deployment feasibility in edge artificial intelligence. Our results reveal that tuning alone leads to a top-1 accuracy improvement of 1.5 to 3.5 percent over baselines, and select models (e.g., RepVGG-A2, MobileNetV3-L) deliver latency under 5 milliseconds and over 9,800 frames per second, making them ideal for edge deployment. This work provides reproducible, subset-based insights into lightweight hyperparameter tuning and its role in balancing speed and accuracy. The code and logs may be seen at: https://vineetkumarrakesh.github.io/lcnn-opt
♻ ☆ Provable Mixed-Noise Learning with Flow-Matching
We study Bayesian inverse problems with mixed noise, modeled as a combination of additive and multiplicative Gaussian components. While traditional inference methods often assume fixed or known noise characteristics, real-world applications, particularly in physics and chemistry, frequently involve noise with unknown and heterogeneous structure. Motivated by recent advances in flow-based generative modeling, we propose a novel inference framework based on conditional flow matching embedded within an Expectation-Maximization (EM) algorithm to jointly estimate posterior samplers and noise parameters. To enable high-dimensional inference and improve scalability, we use simulation-free ODE-based flow matching as the generative model in the E-step of the EM algorithm. We prove that, under suitable assumptions, the EM updates converge to the true noise parameters in the population limit of infinite observations. Our numerical results illustrate the effectiveness of combining EM inference with flow matching for mixed-noise Bayesian inverse problems.
♻ ☆ Online Continual Learning via Spiking Neural Networks with Sleep Enhanced Latent Replay
Edge computing scenarios necessitate the development of hardware-efficient online continual learning algorithms to be adaptive to dynamic environment. However, existing algorithms always suffer from high memory overhead and bias towards recently trained tasks. To tackle these issues, this paper proposes a novel online continual learning approach termed as SESLR, which incorporates a sleep enhanced latent replay scheme with spiking neural networks (SNNs). SESLR leverages SNNs' binary spike characteristics to store replay features in single bits, significantly reducing memory overhead. Furthermore, inspired by biological sleep-wake cycles, SESLR introduces a noise-enhanced sleep phase where the model exclusively trains on replay samples with controlled noise injection, effectively mitigating classification bias towards new classes. Extensive experiments on both conventional (MNIST, CIFAR10) and neuromorphic (NMNIST, CIFAR10-DVS) datasets demonstrate SESLR's effectiveness. On Split CIFAR10, SESLR achieves nearly 30% improvement in average accuracy with only one-third of the memory consumption compared to baseline methods. On Split CIFAR10-DVS, it improves accuracy by approximately 10% while reducing memory overhead by a factor of 32. These results validate SESLR as a promising solution for online continual learning in resource-constrained edge computing scenarios.
comment: There are some errors in the data of the paper. It will take some time to conduct additional experiments for adjustment. We hope to temporarily withdraw the preprint to prevent misleading others
♻ ☆ Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
comment: Preprint
♻ ☆ Exploring the Noise Robustness of Online Conformal Prediction NIPS2025
Conformal prediction is an emerging technique for uncertainty quantification that constructs prediction sets guaranteed to contain the true label with a predefined probability. Recent work develops online conformal prediction methods that adaptively construct prediction sets to accommodate distribution shifts. However, existing algorithms typically assume perfect label accuracy which rarely holds in practice. In this work, we investigate the robustness of online conformal prediction under uniform label noise with a known noise rate, in both constant and dynamic learning rate schedules. We show that label noise causes a persistent gap between the actual mis-coverage rate and the desired rate $\alpha$, leading to either overestimated or underestimated coverage guarantees. To address this issue, we propose Noise Robust Online Conformal Prediction (dubbed NR-OCP) by updating the threshold with a novel robust pinball loss, which provides an unbiased estimate of clean pinball loss without requiring ground-truth labels. Our theoretical analysis shows that NR-OCP eliminates the coverage gap in both constant and dynamic learning rate schedules, achieving a convergence rate of $\mathcal{O}(T^{-1/2})$ for both empirical and expected coverage errors under uniform label noise. Extensive experiments demonstrate the effectiveness of our method by achieving both precise coverage and improved efficiency.
comment: Accepted by NIPS2025
♻ ☆ Partially stochastic deep learning with uncertainty quantification for model predictive heating control
Making the control of building heating systems more energy efficient is crucial for reducing global energy consumption and greenhouse gas emissions. Traditional rule-based control methods use a static, outdoor temperature-dependent heating curve to regulate heat input. This open-loop approach fails to account for both the current state of the system (indoor temperature) and free heat gains, such as solar radiation, often resulting in poor thermal comfort and overheating. Model Predictive Control (MPC) addresses these drawbacks by using predictive modeling to optimize heating based on a building's learned thermal behavior, current system state, and weather forecasts. However, current industrial MPC solutions often employ simplified physics-inspired indoor temperature models, sacrificing accuracy for robustness and interpretability. While purely data-driven models offer superior predictive performance and therefore more accurate control, they face challenges such as a lack of transparency. To bridge this gap, we propose a partially stochastic deep learning (DL) architecture, dubbed LSTM+BNN, for building-specific indoor temperature modeling. Unlike most studies that evaluate model performance through simulations or limited test buildings, our experiments across a comprehensive dataset of 100 real-world buildings, under various weather conditions, demonstrate that LSTM+BNN outperforms an industry-proven reference model, reducing the average prediction error measured as RMSE by more than 40% for the 48-hour prediction horizon of interest. Unlike deterministic DL approaches, LSTM+BNN offers a critical advantage by enabling pre-assessment of model competency for control optimization through uncertainty quantification. Thus, the proposed model shows significant potential to improve thermal comfort and energy efficiency achieved with heating MPC solutions.
♻ ☆ Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses
Large language models (LLMs) are increasingly deployed in real-world applications ranging from chatbots to agentic systems, where they are expected to process untrusted data and follow trusted instructions. Failure to distinguish between the two poses significant security risks, exploited by prompt injection attacks, which inject malicious instructions into the data to control model outputs. Model-level defenses have been proposed to mitigate prompt injection attacks. These defenses fine-tune LLMs to ignore injected instructions in untrusted data. We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses. Checkpoint-GCG enhances the Greedy Coordinate Gradient (GCG) attack by leveraging intermediate model checkpoints produced during fine-tuning to initialize GCG, with each checkpoint acting as a stepping stone for the next one to continuously improve attacks. First, we instantiate Checkpoint-GCG to evaluate the robustness of the state-of-the-art defenses in an auditing setup, assuming both (a) full knowledge of the model input and (b) access to intermediate model checkpoints. We show Checkpoint-GCG to achieve up to $96\%$ attack success rate (ASR) against the strongest defense. Second, we relax the first assumption by searching for a universal suffix that would work on unseen inputs, and obtain up to $89.9\%$ ASR against the strongest defense. Finally, we relax both assumptions by searching for a universal suffix that would transfer to similar black-box models and defenses, achieving an ASR of $63.9\%$ against a newly released defended model from Meta.
♻ ☆ Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM
comment: 26 pages, 5 tables
♻ ☆ WeightLoRA: Keep Only Necessary Adapters
The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ($\texttt{LoRA}$), which adds trainable adapters to selected layers. Although $\texttt{LoRA}$ may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, $\texttt{WeightLoRA}$, which overcomes this issue by adaptive selection of the most critical $\texttt{LoRA}$ heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches. The experimental results demonstrate the efficacy of $\texttt{WeightLoRA}$ and the superior performance of $\texttt{WeightLoRA+}$ in almost all cases.
comment: 13 pages, 9 tables
♻ ☆ Latent Retrieval Augmented Generation of Cross-Domain Protein Binders
Designing protein binders targeting specific sites, which requires to generate realistic and functional interaction patterns, is a fundamental challenge in drug discovery. Current structure-based generative models are limited in generating nterfaces with sufficient rationality and interpretability. In this paper, we propose Retrieval-Augmented Diffusion for Aligned interface (RADiAnce), a new framework that leverages known interfaces to guide the design of novel binders. By unifying retrieval and generation in a shared contrastive latent space, our model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer. Extensive exeriments show that RADiAnce significantly outperforms baseline models across multiple metrics, including binding affinity and recovery of geometries and interactions. Additional experimental results validate cross-domain generalization, demonstrating that retrieving interfaces from diverse domains, such as peptides, antibodies, and protein fragments, enhances the generation performance of binders for other domains. Our work establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.
♻ ☆ Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models
Concept Bottleneck Models (CBNMs) are deep learning models that provide interpretability by enforcing a bottleneck layer where predictions are based exclusively on human-understandable concepts. However, this constraint also restricts information flow and often results in reduced predictive accuracy. Concept Sidechannel Models (CSMs) address this limitation by introducing a sidechannel that bypasses the bottleneck and carry additional task-relevant information. While this improves accuracy, it simultaneously compromises interpretability, as predictions may rely on uninterpretable representations transmitted through sidechannels. Currently, there exists no principled technique to control this fundamental trade-off. In this paper, we close this gap. First, we present a unified probabilistic concept sidechannel meta-model that subsumes existing CSMs as special cases. Building on this framework, we introduce the Sidechannel Independence Score (SIS), a metric that quantifies a CSM's reliance on its sidechannel by contrasting predictions made with and without sidechannel information. We propose SIS regularization, which explicitly penalizes sidechannel reliance to improve interpretability. Finally, we analyze how the expressivity of the predictor and the reliance of the sidechannel jointly shape interpretability, revealing inherent trade-offs across different CSM architectures. Empirical results show that state-of-the-art CSMs, when trained solely for accuracy, exhibit low representation interpretability, and that SIS regularization substantially improves their interpretability, intervenability, and the quality of learned interpretable task predictors. Our work provides both theoretical and practical tools for developing CSMs that balance accuracy and interpretability in a principled manner.
♻ ☆ SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM's activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs.
♻ ☆ Lost in the Averages: A New Specific Setup to Evaluate Membership Inference Attacks Against Machine Learning Models
Synthetic data generators and machine learning models can memorize their training data, posing privacy concerns. Membership inference attacks (MIAs) are a standard method of estimating the privacy risk of these systems. The risk of individual records is typically computed by evaluating MIAs in a record-specific privacy game. We analyze the record-specific privacy game commonly used for evaluating attackers under realistic assumptions (the \textit{traditional} game) -- particularly for synthetic tabular data -- and show that it averages a record's privacy risk across datasets. We show this implicitly assumes the dataset a record is part of has no impact on the record's risk, providing a misleading risk estimate when a specific model or synthetic dataset is released. Instead, we propose a novel use of the leave-one-out game, used in existing work exclusively to audit differential privacy guarantees, and call this the \textit{model-seeded} game. We formalize it and show that it provides an accurate estimate of the privacy risk posed by a given adversary for a record in its specific dataset. We instantiate and evaluate the state-of-the-art MIA for synthetic data generators in the traditional and model-seeded privacy games, and show across multiple datasets and models that the two privacy games indeed result in different risk scores, with up to 94\% of high-risk records being overlooked by the traditional game. We further show that records in smaller datasets and models not protected by strong differential privacy guarantees tend to have a larger gap between risk estimates. Taken together, our results show that the model-seeded setup yields a risk estimate specific to a certain model or synthetic dataset released and in line with the standard notion of privacy leakage from prior work, meaningfully different from the dataset-averaged risk provided by the traditional privacy game.
comment: Data Privacy Management workshop at ESORICS 2025
♻ ☆ Adversarial Disentanglement by Backpropagation with Physics-Informed Variational Autoencoder
Inference and prediction under partial knowledge of a physical system is challenging, particularly when multiple confounding sources influence the measured response. Explicitly accounting for these influences in physics-based models is often infeasible due to epistemic uncertainty, cost, or time constraints, resulting in models that fail to accurately describe the behavior of the system. On the other hand, data-driven machine learning models such as variational autoencoders are not guaranteed to identify a parsimonious representation. As a result, they can suffer from poor generalization performance and reconstruction accuracy in the regime of limited and noisy data. We propose a physics-informed variational autoencoder architecture that combines the interpretability of physics-based models with the flexibility of data-driven models. To promote disentanglement of the known physics and confounding influences, the latent space is partitioned into physically meaningful variables that parametrize a physics-based model, and data-driven variables that capture variability in the domain and class of the physical system. The encoder is coupled with a decoder that integrates physics-based and data-driven components, and constrained by an adversarial training objective that prevents the data-driven components from overriding the known physics, ensuring that the physics-grounded latent variables remain interpretable. We demonstrate that the model is able to disentangle features of the input signal and separate the known physics from confounding influences using supervision in the form of class and domain observables. The model is evaluated on a series of synthetic case studies relevant to engineering structures, demonstrating the feasibility of the proposed approach.
♻ ☆ Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption
Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at $\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$.
comment: 39th Conference on Neural Information Processing Systems
♻ ☆ Mapping Farmed Landscapes from Remote Sensing
Effective management of agricultural landscapes is critical for meeting global biodiversity targets, but efforts are hampered by the absence of detailed, large-scale ecological maps. To address this, we introduce Farmscapes, the first large-scale (covering most of England), high-resolution (25cm) map of rural landscape features, including ecologically vital elements like hedgerows, woodlands, and stone walls. This map was generated using a deep learning segmentation model trained on a novel, dataset of 942 manually annotated tiles derived from aerial imagery. Our model accurately identifies key habitats, achieving high f1-scores for woodland (96\%) and farmed land (95\%), and demonstrates strong capability in segmenting linear features, with an F1-score of 72\% for hedgerows. By releasing the England-wide map on Google Earth Engine, we provide a powerful, open-access tool for ecologists and policymakers. This work enables data-driven planning for habitat restoration, supports the monitoring of initiatives like the EU Biodiversity Strategy, and lays the foundation for advanced analysis of landscape connectivity.
♻ ☆ A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets
Healthcare research and development face significant obstacles due to data scarcity and stringent privacy regulations, such as HIPAA and the GDPR, restricting access to essential real-world medical data. These limitations impede innovation, delay robust AI model creation, and hinder advancements in patient-centered care. Synthetic data generation offers a transformative solution by producing artificial datasets that emulate real data statistics while safeguarding patient privacy. We introduce a novel hybrid framework for high-fidelity healthcare data synthesis integrating five augmentation methods: noise injection, interpolation, Gaussian Mixture Model (GMM) sampling, Conditional Variational Autoencoder (CVAE) sampling, and SMOTE, combined via a reinforcement learning-based dynamic weight selection mechanism. Its key innovations include advanced calibration techniques -- moment matching, full histogram matching, soft and adaptive soft histogram matching, and iterative refinement -- that align marginal distributions and preserve joint feature dependencies. Evaluated on the Breast Cancer Wisconsin (UCI Repository) and Khulna Medical College cardiology datasets, our calibrated hybrid achieves Wasserstein distances as low as 0.001 and Kolmogorov-Smirnov statistics around 0.01, demonstrating near-zero marginal discrepancy. Pairwise trend scores surpass 90%, and Nearest Neighbor Adversarial Accuracy approaches 50%, confirming robust privacy protection. Downstream classifiers trained on synthetic data achieve up to 94% accuracy and F1 scores above 93%, comparable to models trained on real data. This scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare, and supports sensitive AI applications.
♻ ☆ Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach
Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.
♻ ☆ Just One Layer Norm Guarantees Stable Extrapolation
In spite of their prevalence, the behaviour of Neural Networks when extrapolating far from the training distribution remains poorly understood, with existing results limited to specific cases. In this work, we prove general results -- the first of their kind -- by applying Neural Tangent Kernel (NTK) theory to analyse infinitely-wide neural networks trained until convergence and prove that the inclusion of just one Layer Norm (LN) fundamentally alters the induced NTK, transforming it into a bounded-variance kernel. As a result, the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data. In contrast, we show that a broad class of networks without LN can produce pathologically large outputs for certain inputs. We support these theoretical findings with empirical experiments on finite-width networks, demonstrating that while standard NNs often exhibit uncontrolled growth outside the training domain, a single LN layer effectively mitigates this instability. Finally, we explore real-world implications of this extrapolatory stability, including applications to predicting residue sizes in proteins larger than those seen during training and estimating age from facial images of underrepresented ethnicities absent from the training set.
♻ ☆ Measuring the stability and plasticity of recommender systems
The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only captures performance of a particular model trained at some point in the past. We know, however, that online systems evolve over time. In general, it is a good idea that models reflect such changes, so models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns - stability - and, on the other hand, (quickly) adapt to changes - plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity. Although additional experiments will be necessary to confirm these observations, they already illustrate the usefulness of the proposed framework to gain insights on the long term dynamics of recommendation models.
♻ ☆ Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.
♻ ☆ ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models
We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation's policies/principles as directions to move on a model's information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model's CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single information-geometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability
comment: 52 pages, 10 figures, author typo corrected, abstract typo corrected
♻ ☆ Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation
Adapting to diverse user needs at test time is a key challenge in controllable multi-objective generation. Existing methods are insufficient: merging-based approaches provide indirect, suboptimal control at the parameter level, often disregarding the impacts of multiple objectives. While decoding-based guidance is more direct, it typically requires aggregating logits from multiple expert models, incurring significant space overhead and relying heavily on individual model capacity. To address these issues, we introduce Merge-And-GuidE (MAGE), a two-stage framework that leverages model merging for guided decoding. We first identify a critical compatibility problem between the guidance and base models. In Stage 1, MAGE resolves this by dynamically constructing a more robust base model, merging a series of backbone models that account for multiple objectives. In Stage 2, we merge explicit and implicit value models into a unified guidance proxy, which then steers the decoding of the base model from Stage 1. Our analysis empirically validates Linear Mode Connectivity (LMC) in value models, explores the relationship between model merging and prediction ensembling, and demonstrates the enhanced controllability afforded by our approach. Extensive experiments show that our method outperforms existing approaches, achieving superior controllability, Pareto-optimal performance, and enhanced adaptability.
comment: Work in progress
♻ ☆ Best-of-Both Worlds for linear contextual bandits with paid observations
We study the problem of linear contextual bandits with paid observations, where at each round the learner selects an action in order to minimize its loss in a given context, and can then decide to pay a fixed cost to observe the loss of any arm. Building on the Follow-the-Regularized-Leader framework with efficient estimators via Matrix Geometric Resampling, we introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem. We show that it achieves the minimax-optimal regret of $\Theta(T^{2/3})$ in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) stochastic regimes. Our approach builds on the framework from \cite{BOBWhardproblems} to design BOBW algorithms for ``hard problem'', using analysis techniques tailored for the setting that we consider.
comment: error in the proofs
♻ ☆ Boosting Graph Foundation Model from Structural Perspective
Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspective and propose BooG. The model constructs virtual super nodes to unify structural characteristics of graph data from different domains. Specifically, the super nodes fuse the information of anchor nodes and class labels, where each anchor node captures the information of a node or a graph instance to be classified. Instead of using the raw graph structure, we connect super nodes to all nodes within their neighborhood by virtual edges. This new structure allows for effective information aggregation while unifying cross-domain structural characteristics. Additionally, we propose a novel pre-training objective based on contrastive learning, which learns more expressive representations for graph data and generalizes effectively to different domains and downstream tasks. Experimental results on various datasets and tasks demonstrate the superior performance of BooG. We provide our code and data here: https://anonymous.4open.science/r/BooG-EE42/.
comment: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-025-50736-y}
♻ ☆ Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
♻ ☆ MIO: A Foundation Model on Multimodal Tokens
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
comment: EMNLP 2025 (Oral). Codes and models are available in https://github.com/MIO-Team/MIO
♻ ☆ Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
comment: 16 pages, 4 figures
♻ ☆ Why do explanations fail? A typology and discussion on failures in XAI
As Machine Learning models achieve unprecedented levels of performance, the XAI domain aims at making these models understandable by presenting end-users with intelligible explanations. Yet, some existing XAI approaches fail to meet expectations: several issues have been reported in the literature, generally pointing out either technical limitations or misinterpretations by users. In this paper, we argue that the resulting harms arise from a complex overlap of multiple failures in XAI, which existing ad-hoc studies fail to capture. This work therefore advocates for a holistic perspective, presenting a systematic investigation of limitations of current XAI methods and their impact on the interpretation of explanations. % By distinguishing between system-specific and user-specific failures, we propose a typological framework that helps revealing the nuanced complexities of explanation failures. Leveraging this typology, we discuss some research directions to help practitioners better understand the limitations of XAI systems and enhance the quality of ML explanations.
♻ ☆ Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision ICCV 2025
Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://GroundLMM-ICCV.github.io.
comment: ICCV 2025 Findings
♻ ☆ SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe
To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.
♻ ☆ Flows and Diffusions on the Neural Manifold
Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques towards matching a gradient flow, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.
comment: 43 pages, 11 figures, 14 tables
♻ ☆ GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning
As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency-and reduces the best-to-last gap by 85% versus GSPO (1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.
♻ ☆ Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling
A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.
comment: Technical Report
♻ ☆ Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
♻ ☆ Technical and legal aspects of federated learning in bioinformatics: applications, challenges and opportunities
Federated learning leverages data across institutions to improve clinical discovery while complying with data-sharing restrictions and protecting patient privacy. This paper provides a gentle introduction to this approach in bioinformatics, and is the first to review key applications in proteomics, genome-wide association studies (GWAS), single-cell and multi-omics studies in their legal as well as methodological and infrastructural challenges. As the evolution of biobanks in genetics and systems biology has proved, accessing more extensive and varied data pools leads to a faster and more robust exploration and translation of results. More widespread use of federated learning may have a similar impact in bioinformatics, allowing academic and clinical institutions to access many combinations of genotypic, phenotypic and environmental information that are undercovered or not included in existing biobanks.
comment: 30 pages, 4 figures
♻ ☆ SOHES: Self-supervised Open-world Hierarchical Entity Segmentation ICLR 2024
Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES-ICLR.github.io.
comment: ICLR 2024
♻ ☆ When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 32 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.
♻ ☆ KScope: A Framework for Characterizing the Knowledge Status of Language Models NeurIPS 2025
Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.
comment: Accepted by NeurIPS 2025
♻ ☆ Agentic Misalignment: How LLMs Could Be Insider Threats
We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction. In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals - including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment. Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real. We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers (Amodei, 2025). We are releasing our methods publicly to enable further research.
comment: 20 pages, 12 figures. Code available at https://github.com/anthropic-experimental/agentic-misalignment
♻ ☆ Real-Time Adaptive Motion Planning via Point Cloud-Guided, Energy-Based Diffusion and Potential Fields
Motivated by the problem of pursuit-evasion, we present a motion planning framework that combines energy-based diffusion models with artificial potential fields for robust real time trajectory generation in complex environments. Our approach processes obstacle information directly from point clouds, enabling efficient planning without requiring complete geometric representations. The framework employs classifier-free guidance training and integrates local potential fields during sampling to enhance obstacle avoidance. In dynamic scenarios, the system generates initial trajectories using the diffusion model and continuously refines them through potential field-based adaptation, demonstrating effective performance in pursuit-evasion scenarios with partial pursuer observability.
comment: Accepted to IEEE RA-L 2025
♻ ☆ Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping NeurIPS 2025
Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model performance improves, raising a crucial question: How should the total budget for generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework for analyzing budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies -- particularly exponential growth policies -- exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.
comment: NeurIPS 2025 (spotlight)
♻ ☆ Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics
Recent advances in generative artificial intelligence applications have raised new data security concerns. This paper focuses on defending diffusion models against membership inference attacks. This type of attack occurs when the attacker can determine if a certain data point was used to train the model. Although diffusion models are intrinsically more resistant to membership inference attacks than other generative models, they are still susceptible. The defense proposed here utilizes critically-damped higher-order Langevin dynamics, which introduces several auxiliary variables and a joint diffusion process along these variables. The idea is that the presence of auxiliary variables mixes external randomness that helps to corrupt sensitive input data earlier on in the diffusion process. This concept is theoretically investigated and validated on a toy dataset and a speech dataset using the Area Under the Receiver Operating Characteristic (AUROC) curves and the FID metric.
comment: 5 pages, 2 figures, 1 table
♻ ☆ Uncertainty Quantification with the Empirical Neural Tangent Kernel
While neural networks have demonstrated impressive performance across various tasks, accurately quantifying uncertainty in their predictions is essential to ensure their trustworthiness and enable widespread adoption in critical systems. Several Bayesian uncertainty quantification (UQ) methods exist that are either cheap or reliable, but not both. We propose a post-hoc, sampling-based UQ method for over-parameterized networks at the end of training. Our approach constructs efficient and meaningful deep ensembles by employing a (stochastic) gradient-descent sampling process on appropriately linearized networks. We demonstrate that our method effectively approximates the posterior of a Gaussian process using the empirical Neural Tangent Kernel. Through a series of numerical experiments, we show that our method not only outperforms competing approaches in computational efficiency-often reducing costs by multiple factors-but also maintains state-of-the-art performance across a variety of UQ metrics for both regression and classification tasks.
comment: 39 pages, 6 figures, 13 tables
♻ ☆ Symmetry-Aware GFlowNets ICML 2025
Generative Flow Networks (GFlowNets) offer a powerful framework for sampling graphs in proportion to their rewards. However, existing approaches suffer from systematic biases due to inaccuracies in state transition probability computations. These biases, rooted in the inherent symmetries of graphs, impact both atom-based and fragment-based generation schemes. To address this challenge, we introduce Symmetry-Aware GFlowNets (SA-GFN), a method that incorporates symmetry corrections into the learning process through reward scaling. By integrating bias correction directly into the reward structure, SA-GFN eliminates the need for explicit state transition computations. Empirical results show that SA-GFN enables unbiased sampling while enhancing diversity and consistently generating high-reward graphs that closely match the target distribution.
comment: 29 pages; Accepted at ICML 2025
♻ ☆ R1-Ranker: Teaching LLM Rankers to Reason
Large language models (LLMs) have recently shown strong reasoning abilities in domains like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks, where prime examples include retrieval, recommender systems, and LLM routing, remains underexplored. Ranking requires complex reasoning across heterogeneous candidates, but existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack iterative refinement, limiting their ability to fully exploit LLMs' reasoning potential. To address these challenges, we propose R1-Ranker, a reasoning-incentive framework built on reinforcement learning, with two complementary designs: DRanker, which generates full rankings in one shot, and IRanker, which decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning. We evaluate unified R1-Rankers on nine datasets spanning recommendation, routing, and passage ranking, showing that IRanker-3B consistently achieves state-of-the-art performance, surpasses larger 7B models on some tasks, and yields a 15.7% average relative improvement. Ablation and generalization experiments further confirm the critical role of reinforcement learning and iterative reasoning, with IRanker-3B improving zero-shot performance by over 9% on out-of-domain tasks and reasoning traces boosting other LLMs by up to 22.87%. These results demonstrate that unifying diverse ranking tasks with a single reasoning-driven foundation model is both effective and essential for advancing LLM reasoning in ranking scenarios.
♻ ☆ Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling
Samples of brain signals collected by EEG sensors have inherent anti-correlations that are well modeled by negative edges in a finite graph. To differentiate epilepsy patients from healthy subjects using collected EEG signals, we build lightweight and interpretable transformer-like neural nets by unrolling a spectral denoising algorithm for signals on a balanced signed graph -- graph with no cycles of odd number of negative edges. A balanced signed graph has well-defined frequencies that map to a corresponding positive graph via similarity transform of the graph Laplacian matrices. We implement an ideal low-pass filter efficiently on the mapped positive graph via Lanczos approximation, where the optimal cutoff frequency is learned from data. Given that two balanced signed graph denoisers learn posterior probabilities of two different signal classes during training, we evaluate their reconstruction errors for binary classification of EEG signals. Experiments show that our method achieves classification performance comparable to representative deep learning schemes, while employing dramatically fewer parameters.
♻ ☆ FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling NeurIPS 2025
Federated Learning (FL) enables collaborative model training across distributed clients without data sharing, but its high computational and communication demands strain resource-constrained devices. While existing methods use dynamic pruning to improve efficiency by periodically adjusting sparse model topologies while maintaining sparsity, these approaches suffer from issues such as greedy adjustments, unstable topologies, and communication inefficiency, resulting in less robust models and suboptimal performance under data heterogeneity and partial client availability. To address these challenges, we propose Federated Robust pruning via combinatorial Thompson Sampling (FedRTS), a novel framework designed to develop robust sparse models. FedRTS enhances robustness and performance through its Thompson Sampling-based Adjustment (TSAdj) mechanism, which uses probabilistic decisions informed by stable, farsighted information instead of deterministic decisions reliant on unstable and myopic information in previous methods. Extensive experiments demonstrate that FedRTS achieves state-of-the-art performance in computer vision and natural language processing tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation. Our codes are available at: https://github.com/Little0o0/FedRTS
comment: Accepted by NeurIPS 2025
♻ ☆ ELASTIC: Efficient Once For All Iterative Search for Object Detection on Microcontrollers
Deploying high-performance object detectors on TinyML platforms poses significant challenges due to tight hardware constraints and the modular complexity of modern detection pipelines. Neural Architecture Search (NAS) offers a path toward automation, but existing methods either restrict optimization to individual modules, sacrificing cross-module synergy, or require global searches that are computationally intractable. We propose ELASTIC (Efficient Once for AlL IterAtive Search for ObjecT DetectIon on MiCrocontrollers), a unified, hardware-aware NAS framework that alternates optimization across modules (e.g., backbone, neck, and head) in a cyclic fashion. ELASTIC introduces a novel Population Passthrough mechanism in evolutionary search that retains high-quality candidates between search stages, yielding faster convergence, up to an 8% final mAP gain, and eliminates search instability observed without population passthrough. In a controlled comparison, empirical results show ELASTIC achieves +4.75% higher mAP and 2x faster convergence than progressive NAS strategies on SVHN, and delivers a +9.09% mAP improvement on PascalVOC given the same search budget. ELASTIC achieves 72.3% mAP on PascalVOC, outperforming MCUNET by 20.9% and TinyissimoYOLO by 16.3%. When deployed on MAX78000/MAX78002 microcontrollers, ELASTICderived models outperform Analog Devices' TinySSD baselines, reducing energy by up to 71.6%, lowering latency by up to 2.4x, and improving mAP by up to 6.99 percentage points across multiple datasets.
comment: 8 pages, 7 figures
♻ ☆ Minimax Optimal Kernel Two-Sample Tests with Random Features
Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy), for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently have minimax optimal two-sample tests been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral-regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similarly (with a small drop in power) to the exact test.
comment: 69 pages, 10 figures, 5 tables
♻ ☆ RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.
♻ ☆ An analysis of the derivative-free loss method for solving PDEs
This study analyzes the derivative-free loss method to solve a certain class of elliptic PDEs and fluid problems using neural networks. The approach leverages the Feynman-Kac formulation, incorporating stochastic walkers and their averaged values. We investigate how the time interval associated with the Feynman-Kac representation and the walker size influence computational efficiency, trainability, and sampling errors. Our analysis shows that the training loss bias scales proportionally with the time interval and the spatial gradient of the neural network, while being inversely proportional to the walker size. Moreover, we demonstrate that the time interval must be sufficiently long to enable effective training. These results indicate that the walker size can be chosen as small as possible, provided it satisfies the optimal lower bound determined by the time interval. Finally, we present numerical experiments that support our theoretical findings.
comment: 21 pages, 8 figures
♻ ☆ VERITAS: Verifying the Performance of AI-native Transceiver Actions in Base-Stations
Artificial Intelligence (AI)-native receivers prove significant performance improvement in high noise regimes and can potentially reduce communication overhead compared to the traditional receiver. However, their performance highly depends on the representativeness of the training dataset. A major issue is the uncertainty of whether the training dataset covers all test environments and waveform configurations, and thus, whether the trained model is robust in practical deployment conditions. To this end, we propose a joint measurement-recovery framework for AI-native transceivers post deployment, called VERITAS, that continuously looks for distribution shifts in the received signals and triggers finite re-training spurts. VERITAS monitors the wireless channel using 5G pilots fed to an auxiliary neural network that detects out-of-distribution channel profile, transmitter speed, and delay spread. As soon as such a change is detected, a traditional (reference) receiver is activated, which runs for a period of time in parallel to the AI-native receiver. Finally, VERTIAS compares the bit probabilities of the AI-native and the reference receivers for the same received data inputs, and decides whether or not a retraining process needs to be initiated. Our evaluations reveal that VERITAS can detect changes in the channel profile, transmitter speed, and delay spread with 99%, 97%, and 69% accuracies, respectively, followed by timely initiation of retraining for 86%, 93.3%, and 94.8% of inputs in channel profile, transmitter speed, and delay spread test sets, respectively.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Benchmarking drug-drug interaction prediction methods: a perspective of distribution changes
Motivation: Emerging drug-drug interaction (DDI) prediction is crucial for new drugs but is hindered by distribution changes between known and new drugs in real-world scenarios. Current evaluation often neglects these changes, relying on unrealistic i.i.d. split due to the absence of drug approval data. Results: We propose DDI-Ben, a benchmarking framework for emerging DDI prediction under distribution changes. DDI-Ben introduces a distribution change simulation framework that leverages distribution changes between drug sets as a surrogate for real-world distribution changes of DDIs, and is compatible with various drug split strategies. Through extensive benchmarking on ten representative methods, we show that most existing approaches suffer substantial performance degradation under distribution changes. Our analysis further indicates that large language model (LLM) based methods and the integration of drug-related textual information offer promising robustness against such degradation. To support future research, we release the benchmark datasets with simulated distribution changes. Overall, DDI-Ben highlights the importance of explicitly addressing distribution changes and provides a foundation for developing more resilient methods for emerging DDI prediction. Availability and implementation: Our code and data are available at https://github.com/LARS-research/DDI-Bench.
♻ ☆ Protenix-Mini+: efficient structure prediction model with scalable pairformer
Lightweight inference is critical for biomolecular structure prediction and downstream tasks, enabling efficient real-world deployment and inference-time scaling for large-scale applications. While AF3 and its variants (e.g., Protenix, Chai-1) have advanced structure prediction results, they suffer from critical limitations: high inference latency and cubic time complexity with respect to token count, both of which restrict scalability for large biomolecular complexes. To address the core challenge of balancing model efficiency and prediction accuracy, we introduce three key innovations: (1) compressing non-scalable operations to mitigate cubic time complexity, (2) removing redundant blocks across modules to reduce unnecessary overhead, and (3) adopting a few-step sampler for the atom diffusion module to accelerate inference. Building on these design principles, we develop Protenix-Mini+, a highly lightweight and scalable variant of the Protenix model. Within an acceptable range of performance degradation, it substantially improves computational efficiency. For example, in the case of low-homology single-chain proteins, Protenix-Mini+ experiences an intra-protein LDDT drop of approximately 3% relative to the full Protenix model -- an acceptable performance trade-off given its substantially 90%+ improved computational efficiency.
♻ ☆ Regularizing Extrapolation in Causal Inference
Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel "bias-bias-variance" tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.
♻ ☆ EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.
comment: Tianyu Chen and Yasi Zhang contributed equally; Oscar Leong, Lijuan Wang, Ying Nian Wu, and Mingyuan Zhou advised equally
♻ ☆ SoK: Adversarial Evasion Attacks Practicality in NIDS Domain and the Impact of Dynamic Learning
Machine Learning (ML) has become pervasive, and its deployment in Network Intrusion Detection Systems (NIDS) is inevitable due to its automated nature and high accuracy compared to traditional models in processing and classifying large volumes of data. However, ML has been found to have several flaws, most importantly, adversarial attacks, which aim to trick ML models into producing faulty predictions. While most adversarial attack research focuses on computer vision datasets, recent studies have explored the suitability of these attacks against ML-based network security entities, especially NIDS, due to the wide difference between different domains regarding the generation of adversarial attacks. To further explore the practicality of adversarial attacks against ML-based NIDS in-depth, this paper presents several key contributions: identifying numerous practicality issues for evasion adversarial attacks on ML-NIDS using an attack tree threat model, introducing a taxonomy of practicality issues associated with adversarial attacks against ML-based NIDS, identifying specific leaf nodes in our attack tree that demonstrate some practicality for real-world implementation and conducting a comprehensive review and exploration of these potentially viable attack approaches, and investigating how the dynamicity of real-world ML models affects evasion adversarial attacks against NIDS. Our experiments indicate that continuous re-training, even without adversarial training, can reduce the effectiveness of adversarial attacks. While adversarial attacks can compromise ML-based NIDSs, our aim is to highlight the significant gap between research and real-world practicality in this domain, which warrants attention.
♻ ☆ One-Step Flow Policy Mirror Descent
Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during flow policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on rectified flow policy and MeanFlow policy, respectively. Extensive empirical evaluations on MuJoCo and visual DeepMind Control Suite benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring orders of magnitude less computational cost during inference.
♻ ☆ Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model's performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs' performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.
♻ ☆ Byzantine Failures Harm the Generalization of Robust Distributed Learning Algorithms More Than Data Poisoning
Robust distributed learning algorithms aim to maintain reliable performance despite the presence of misbehaving workers. Such misbehaviors are commonly modeled as Byzantine failures, allowing arbitrarily corrupted communication, or as data poisoning, a weaker form of corruption restricted to local training data. While prior work shows similar optimization guarantees for both models, an important question remains: How do these threat models impact generalization? Empirical evidence suggests a gap, yet it remains unclear whether it is unavoidable or merely an artifact of suboptimal attacks. We show, for the first time, a fundamental gap in generalization guarantees between the two threat models: Byzantine failures yield strictly worse rates than those achievable under data poisoning. Our findings leverage a tight algorithmic stability analysis of robust distributed learning. Specifically, we prove that: (i) under data poisoning, the uniform algorithmic stability of an algorithm with optimal optimization guarantees degrades by an additive factor of $\varTheta ( \frac{f}{n-f} )$, with $f$ out of $n$ workers misbehaving; whereas $\textit{(ii)}$ under Byzantine failures, the degradation is in $\Omega \big( \sqrt{ \frac{f}{n-2f}} \big)$.
♻ ☆ Low-Rank Adaptation of Neural Fields SIGGRAPH
Processing visual data often involves small adjustments or sequences of changes, e.g., image filtering, surface smoothing, and animation. While established graphics techniques like normal mapping and video compression exploit redundancy to encode such small changes efficiently, the problem of encoding small changes to neural fields -- neural network parameterizations of visual or physical functions -- has received less attention. We propose a parameter-efficient strategy for updating neural fields using low-rank adaptations (LoRA). LoRA, a method from the parameter-efficient fine-tuning LLM community, encodes small updates to pre-trained models with minimal computational overhead. We adapt LoRA for instance-specific neural fields, avoiding the need for large pre-trained models and yielding lightweight updates. We validate our approach with experiments in image filtering, geometry editing, video compression, and energy-based editing, demonstrating its effectiveness and versatility for representing neural field updates.
comment: SIGGRAPH Asia 2025
♻ ☆ SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow NeurIPS
Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.
comment: The paper has been accepted to NeurIPS Creative AI Track 2025. Figure 4(c) has been accepted to CVPR AI Art Gallery 2025
♻ ☆ Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment
Automated ultrasonic weld inspection remains a significant challenge in the nondestructive evaluation (NDE) community to factors such as limited training data (due to the complexity of curating experimental specimens or high-fidelity simulations) and environmental volatility of many industrial settings (resulting in the corruption of on-the-fly measurements). Thus, an end-to-end machine learning (ML) workflow for acoustic weld inspection in realistic (i.e., industrial) settings has remained an elusive goal. This work addresses the challenges of data curation and signal corruption by proposing workflow consisting of a reduced-order modeling scheme, diffusion based distribution alignment, and U-Net-based segmentation and inversion. A reduced-order Helmholtz model based on Lamb wave theory is used to generate a comprehensive dataset over varying weld heterogeneity and crack defects. The relatively inexpensive low-order solutions provide a robust training dateset for inversion models which are refined through a transfer learning stage using a limited set of full 3D elastodynamic simulations. To handle out-of-distribution (OOD) real-world measurements with varying and unpredictable noise distributions, i.e., Laser Doppler Vibrometry scans, guided diffusion produces in-distribution representations of OOD experimental LDV scans which are subsequently processed by the inversion models. This integrated framework provides an end-to-end solution for automated weld inspection on real data.
comment: 26 pages, 6 page appendix
♻ ☆ A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies
Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain underdiagnosed and poorly characterized due to their low prevalence and limited clinician familiarity. Computational phenotyping offers a scalable approach to improving rare disease detection, but algorithm development is hindered by the scarcity of high-quality labeled data for training. Expert-labeled datasets from chart reviews and registries are clinically accurate but limited in scope and availability, whereas labels derived from electronic health records (EHRs) provide broader coverage but are often noisy or incomplete. To address these challenges, we propose WEST (WEakly Supervised Transformer for rare disease phenotyping and subphenotyping from EHRs), a framework that combines routinely collected EHR data with a limited set of expert-validated cases and controls to enable large-scale phenotyping. At its core, WEST employs a weakly supervised transformer model trained on extensive probabilistic silver-standard labels - derived from both structured and unstructured EHR features - that are iteratively refined during training to improve model calibration. We evaluate WEST on two rare pulmonary diseases using EHR data from Boston Children's Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression. By reducing reliance on manual annotation, WEST enables data-efficient rare disease phenotyping that improves cohort definition, supports earlier and more accurate diagnosis, and accelerates data-driven discovery for the rare disease community.
comment: 21 pages, 7 figures
♻ ☆ AI Guided Accelerator For Search Experience
Effective query reformulation is pivotal in narrowing the gap between a user's exploratory search behavior and the identification of relevant products in e-commerce environments. While traditional approaches predominantly model query rewrites as isolated pairs, they often fail to capture the sequential and transitional dynamics inherent in real-world user behavior. In this work, we propose a novel framework that explicitly models transitional queries--intermediate reformulations occurring during the user's journey toward their final purchase intent. By mining structured query trajectories from eBay's large-scale user interaction logs, we reconstruct query sequences that reflect shifts in intent while preserving semantic coherence. This approach allows us to model a user's shopping funnel, where mid-journey transitions reflect exploratory behavior and intent refinement. Furthermore, we incorporate generative Large Language Models (LLMs) to produce semantically diverse and intent-preserving alternative queries, extending beyond what can be derived through collaborative filtering alone. These reformulations can be leveraged to populate Related Searches or to power intent-clustered carousels on the search results page, enhancing both discovery and engagement. Our contributions include (i) the formal identification and modeling of transitional queries, (ii) the introduction of a structured query sequence mining pipeline for intent flow understanding, and (iii) the application of LLMs for scalable, intent-aware query expansion. Empirical evaluation demonstrates measurable gains in conversion and engagement metrics compared to the existing Related Searches module, validating the effectiveness of our approach in real-world e-commerce settings.
comment: Accepted at SIGIR eCom'25. https://sigir-ecom.github.io/eCom25Papers/paper_25.pdf
♻ ☆ Loss-Complexity Landscape and Model Structure Functions
We develop a framework for dualizing the Kolmogorov structure function $h_x(\alpha)$, which then allows using computable complexity proxies. We establish a mathematical analogy between information-theoretic constructs and statistical mechanics, introducing a suitable partition function and free energy functional. We explicitly prove the Legendre-Fenchel duality between the structure function and free energy, showing detailed balance of the Metropolis kernel, and interpret acceptance probabilities as information-theoretic scattering amplitudes. A susceptibility-like variance of model complexity is shown to peak precisely at loss-complexity trade-offs interpreted as phase transitions. Practical experiments with linear and tree-based regression models verify these theoretical predictions, explicitly demonstrating the interplay between the model complexity, generalization, and overfitting threshold.
comment: 25 pages, 11 figures; GitHub repository at https://github.com/sashakolpakov/structure-functions
♻ ☆ Optimally Deep Networks - Adapting Model Depth to Datasets for Superior Efficiency
Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.
comment: 6 pages, 3 figures, 1 table
♻ ☆ MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation
Clouds remain a critical challenge in optical satellite imagery, hindering reliable analysis for environmental monitoring, land cover mapping, and climate research. To overcome this, we propose MSCloudCAM, a Cross-Attention with Multi-Scale Context Network tailored for multispectral and multi-sensor cloud segmentation. Our framework exploits the spectral richness of Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data to classify four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow. MSCloudCAM combines a Swin Transformer backbone for hierarchical feature extraction with multi-scale context modules ASPP and PSP for enhanced scale-aware learning. A Cross-Attention block enables effective multisensor and multispectral feature fusion, while the integration of an Efficient Channel Attention Block (ECAB) and a Spatial Attention Module adaptively refine feature representations. Comprehensive experiments on CloudSEN12 and L8Biome demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy, surpassing leading baseline architectures while maintaining competitive parameter efficiency and FLOPs. These results underscore the model's effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.
comment: 7 pages, 2 Figures
♻ ☆ Explainable Machine Learning for Oxygen Diffusion in Perovskites and Pyrochlores
Explainable machine learning can help to discover new physical relationships for material properties. To understand the material properties that govern the activation energy for oxygen diffusion in perovskites and pyrochlores, we build a database of experimental activation energies and apply a grouping algorithm to the material property features. These features are then used to fit seven different machine learning models. An ensemble consensus determines that the most important features for predicting the activation energy are the ionicity of the A-site bond and the partial pressure of oxygen for perovskites. For pyrochlores, the two most important features are the A-site $s$ valence electron count and the B-site electronegativity. The most important features are all constructed using the weighted averages of elemental metal properties, despite weighted averages of the constituent binary oxides being included in our feature set. This is surprising because the material properties of the constituent oxides are more similar to the experimentally measured properties of perovskites and pyrochlores than the features of the metals that are chosen. The easy-to-measure features identified in this work enable rapid screening for new materials with fast oxide-ion diffusivity.
comment: 35 pages, 11 figures
♻ ☆ METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.
comment: 17 pages, 18 figures
♻ ☆ Photovoltaic power forecasting using quantum machine learning
Accurate forecasting of photovoltaic power is essential for reliable grid integration, yet remains difficult due to highly variable irradiance, complex meteorological drivers, site geography, and device-specific behavior. Although contemporary machine learning has achieved successes, it is not clear that these approaches are optimal: new model classes may further enhance performance and data efficiency. We investigate hybrid quantum neural networks for time-series forecasting of photovoltaic power and introduce two architectures. The first, a Hybrid Quantum Long Short-Term Memory model, reduces mean absolute error and mean squared error by more than 40% relative to the strongest baselines evaluated. The second, a Hybrid Quantum Sequence-to-Sequence model, once trained, it predicts power for arbitrary forecast horizons without requiring prior meteorological inputs and achieves a 16% lower mean absolute error than the best baseline on this task. Both hybrid models maintain superior accuracy when training data are limited, indicating improved data efficiency. These results show that hybrid quantum models address key challenges in photovoltaic power forecasting and offer a practical route to more reliable, data-efficient energy predictions.
comment: 18 pages, 7 figures, 4 tables
♻ ☆ Variational Autoencoders for Efficient Simulation-Based Inference
We present a generative modeling approach based on the variational inference framework for likelihood-free simulation-based inference. The method leverages latent variables within variational autoencoders to efficiently estimate complex posterior distributions arising from stochastic simulations. We explore two variations of this approach distinguished by their treatment of the prior distribution. The first model adapts the prior based on observed data using a multivariate prior network, enhancing generalization across various posterior queries. In contrast, the second model utilizes a standard Gaussian prior, offering simplicity while still effectively capturing complex posterior distributions. We demonstrate the ability of the proposed approach to approximate complex posteriors while maintaining computational efficiency on well-established benchmark problems.
♻ ☆ Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors
Prior-data fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular data sets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled and efficient sampling procedure to construct Bayesian posteriors for such estimates based on Martingale posteriors, and prove its convergence. Several simulated and real-world data examples showcase the uncertainty quantification of our method in inference applications.
♻ ☆ Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models
Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.
comment: 26 pages, 8 figures
♻ ☆ Rewind-to-Delete: Certified Machine Unlearning for Nonconvex Functions NeurIPS 2025
Machine unlearning algorithms aim to efficiently remove data from a model without retraining it from scratch, in order to remove corrupted or outdated data or respect a user's ``right to be forgotten." Certified machine unlearning is a strong theoretical guarantee based on differential privacy that quantifies the extent to which an algorithm erases data from the model weights. In contrast to existing works in certified unlearning for convex or strongly convex loss functions, or nonconvex objectives with limiting assumptions, we propose the first, first-order, black-box (i.e., can be applied to models pretrained with vanilla gradient descent) algorithm for unlearning on general nonconvex loss functions, which unlearns by ``rewinding" to an earlier step during the learning process before performing gradient descent on the loss function of the retained data points. We prove $(\epsilon, \delta)$ certified unlearning and performance guarantees that establish the privacy-utility-complexity tradeoff of our algorithm, and we prove generalization guarantees for functions that satisfy the Polyak-Lojasiewicz inequality. Finally, we demonstrate the superior performance of our algorithm compared to existing methods, within a new experimental framework that more accurately reflects unlearning user data in practice.
comment: NeurIPS 2025
♻ ☆ Evaluating Sparse Autoencoders for Monosemantic Representation
A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron's activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.
♻ ☆ Graph-based Neural Space Weather Forecasting NeurIPS 2025
Accurate space weather forecasting is crucial for protecting our increasingly digital infrastructure. Hybrid-Vlasov models, like Vlasiator, offer physical realism beyond that of current operational systems, but are too computationally expensive for real-time use. We introduce a graph-based neural emulator trained on Vlasiator data to autoregressively predict near-Earth space conditions driven by an upstream solar wind. We show how to achieve both fast deterministic forecasts and, by using a generative model, produce ensembles to capture forecast uncertainty. This work demonstrates that machine learning offers a way to add uncertainty quantification capability to existing space weather prediction systems, and make hybrid-Vlasov simulation tractable for operational use.
comment: 20 pages, 18 figures. Accepted to the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences
♻ ☆ GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping
Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose \textit{GradES}, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning for both language and vision-language models. \textit{GradES} tracks the magnitude of gradient changes in backpropagation for these matrices during training. When a projection matrix's magnitude of gradient changes fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. \textit{GradES} speeds up training time by 1.57--7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2\% higher average accuracy in language tasks and 3.88\% on multimodal benchmarks.
comment: 20 pages, 5 figures
♻ ☆ Hydra: A Modular Architecture for Efficient Long-Context Reasoning NeurIPS
The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key memory. We evaluate a 29M parameter model measuring logical chaining accuracy and throughput on synthetic sequences, plus throughput on WikiText. Ablation studies use component-specific synthetic datasets to isolate individual mechanisms. Hydra achieves $3.01\times$ and $3.0\times$ throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and $10\times$ accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component's contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval.
comment: Updated with the new paper accepted to NeurIPS workshop
♻ ☆ Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B\'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B\'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.
comment: 20 pages, 4 figures
♻ ☆ End-to-End Learning Framework for Solving Non-Markovian Optimal Control
Integer-order calculus often falls short in capturing the long-range dependencies and memory effects found in many real-world processes. Fractional calculus addresses these gaps via fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control due to the lack of standard control methodologies. In this paper, we theoretically derive the optimal control via linear quadratic regulator (LQR) for fractional-order linear time-invariant (FOLTI) systems and develop an end-to-end deep learning framework based on this theoretical foundation. Our approach establishes a rigorous mathematical model, derives analytical solutions, and incorporates deep learning to achieve data-driven optimal control of FOLTI systems. Our key contributions include: (i) proposing an innovative system identification method control strategy for FOLTI systems, (ii) developing the first end-to-end data-driven learning framework, Fractional-Order Learning for Optimal Control (FOLOC), that learns control policies from observed trajectories, and (iii) deriving a theoretical analysis of sample complexity to quantify the number of samples required for accurate optimal control in complex real-world problems. Experimental results indicate that our method accurately approximates fractional-order system behaviors without relying on Gaussian noise assumptions, pointing to promising avenues for advanced optimal control.
♻ ☆ Uncertainty Quantification for Physics-Informed Neural Networks with Extended Fiducial Inference
Uncertainty quantification (UQ) in scientific machine learning is increasingly critical as neural networks are widely adopted to tackle complex problems across diverse scientific disciplines. For physics-informed neural networks (PINNs), a prominent model in scientific machine learning, uncertainty is typically quantified using Bayesian or dropout methods. However, both approaches suffer from a fundamental limitation: the prior distribution or dropout rate required to construct honest confidence sets cannot be determined without additional information. In this paper, we propose a novel method within the framework of extended fiducial inference (EFI) to provide rigorous uncertainty quantification for PINNs. The proposed method leverages a narrow-neck hyper-network to learn the parameters of the PINN and quantify their uncertainty based on imputed random errors in the observations. This approach overcomes the limitations of Bayesian and dropout methods, enabling the construction of honest confidence sets based solely on observed data. This advancement represents a significant breakthrough for PINNs, greatly enhancing their reliability, interpretability, and applicability to real-world scientific and engineering challenges. Moreover, it establishes a new theoretical framework for EFI, extending its application to large-scale models, eliminating the need for sparse hyper-networks, and significantly improving the automaticity and robustness of statistical inference.
♻ ☆ On the Interaction of Compressibility and Adversarial Robustness
Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.
♻ ☆ Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging concept whitening technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.
♻ ☆ Autonomous Cyber Resilience via a Co-Evolutionary Arms Race within a Fortified Digital Twin Sandbox
The convergence of Information Technology and Operational Technology has exposed Industrial Control Systems to adaptive, intelligent adversaries that render static defenses obsolete. This paper introduces the Adversarial Resilience Co-evolution (ARC) framework, addressing the "Trinity of Trust" comprising model fidelity, data integrity, and analytical resilience. ARC establishes a co-evolutionary arms race within a Fortified Secure Digital Twin (F-SCDT), where a Deep Reinforcement Learning "Red Agent" autonomously discovers attack paths while an ensemble-based "Blue Agent" is continuously hardened against these threats. Experimental validation on the Tennessee Eastman Process (TEP) and Secure Water Treatment (SWaT) testbeds demonstrates superior performance in detecting novel attacks, with F1-scores improving from 0.65 to 0.89 and detection latency reduced from over 1200 seconds to 210 seconds. A comprehensive ablation study reveals that the co-evolutionary process itself contributes a 27% performance improvement. By integrating Explainable AI and proposing a Federated ARC architecture, this work presents a necessary paradigm shift toward dynamic, self-improving security for critical infrastructure.
comment: 6 pages, 2 figures, 4 equations, 1 algorithm, 3 tables, to be published in ISPACS 2025, unabridged version exists as arXiv:2506.20102v1
Computer Vision and Pattern Recognition 185
☆ PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning
Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.
comment: Project Page: https://sihuiji.github.io/PhysMaster-Page/
☆ VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models
Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.
☆ Generative Universal Verifier as Multimodal Meta-Reasoner
We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.
☆ Trace Anything: Representing Any Video in 4D via Trajectory Fields
Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project page: https://trace-anything.github.io/.
☆ Reasoning in Space via Grounding in the World
In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.
comment: 20 pages, 7 figures
☆ The Mechanistic Emergence of Symbol Grounding in Language Models
Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.
☆ Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
comment: homepage: https://open-bee.github.io/
☆ NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models
With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose , a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.
comment: code available at: https://github.com/nirgoren/NoisePrints
☆ Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation
Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.
☆ InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.
comment: Technical report
☆ UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations
Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at https://github.com/DominikM198/UrbanFusion.
☆ Scaling Vision Transformers for Functional MRI with Flat Maps NeurIPS 2025
A key question for adapting modern deep learning architectures to functional MRI (fMRI) is how to represent the data for model input. To bridge the modality gap between fMRI and natural images, we transform the 4D volumetric fMRI data into videos of 2D fMRI activity flat maps. We train Vision Transformers on 2.3K hours of fMRI flat map videos from the Human Connectome Project using the spatiotemporal masked autoencoder (MAE) framework. We observe that masked fMRI modeling performance improves with dataset size according to a strict power scaling law. Downstream classification benchmarks show that our model learns rich representations supporting both fine-grained state decoding across subjects, as well as subject-specific trait decoding across changes in brain state. This work is part of an ongoing open science project to build foundation models for fMRI data. Our code and datasets are available at https://github.com/MedARC-AI/fmri-fm.
comment: NeurIPS 2025 Workshop, Foundation Models for the Brain and Body; Code: https://github.com/MedARC-AI/fmri-fm; Discord: https://discord.gg/tVR4TWnRM9
☆ Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.
comment: Equal contributions from frst three authors. Project page: https://vchitect.github.io/Uni-MMMU-Project/ Code: https://github.com/vchitect/Uni-MMMU
☆ RECODE: Reasoning Through Code Generation for Visual Question Answering
Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.
☆ UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy
Computational replication of Chinese calligraphy remains challenging. Existing methods falter, either creating high-quality isolated characters while ignoring page-level aesthetics like ligatures and spacing, or attempting page synthesis at the expense of calligraphic correctness. We introduce \textbf{UniCalli}, a unified diffusion framework for column-level recognition and generation. Training both tasks jointly is deliberate: recognition constrains the generator to preserve character structure, while generation provides style and layout priors. This synergy fosters concept-level abstractions that improve both tasks, especially in limited-data regimes. We curated a dataset of over 8,000 digitized pieces, with ~4,000 densely annotated. UniCalli employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data. The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition. The framework successfully extends to other ancient scripts, including Oracle bone inscriptions and Egyptian hieroglyphs. Code and data can be viewed in \href{https://github.com/EnVision-Research/UniCalli}{this URL}.
comment: 22 pages
☆ Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs
Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at https://github.com/mmunir127/LogViG-Official.
comment: Published in the Proceedings of the Third Learning on Graphs Conference (LoG 2024)
☆ Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis
Synthesizing high-quality images from low-field MRI holds significant potential. Low-field MRI is cheaper, more accessible, and safer, but suffers from low resolution and poor signal-to-noise ratio. This synthesis process can reduce reliance on costly acquisitions and expand data availability. However, synthesizing high-field MRI still suffers from a clinical fidelity gap. There is a need to preserve anatomical fidelity, enhance fine-grained structural details, and bridge domain gaps in image contrast. To address these issues, we propose a \emph{cyclic self-supervised diffusion (CSS-Diff)} framework for high-field MRI synthesis from real low-field MRI data. Our core idea is to reformulate diffusion-based synthesis under a cycle-consistent constraint. It enforces anatomical preservation throughout the generative process rather than just relying on paired pixel-level supervision. The CSS-Diff framework further incorporates two novel processes. The slice-wise gap perception network aligns inter-slice inconsistencies via contrastive learning. The local structure correction network enhances local feature restoration through self-reconstruction of masked and perturbed patches. Extensive experiments on cross-field synthesis tasks demonstrate the effectiveness of our method, achieving state-of-the-art performance (e.g., 31.80 $\pm$ 2.70 dB in PSNR, 0.943 $\pm$ 0.102 in SSIM, and 0.0864 $\pm$ 0.0689 in LPIPS). Beyond pixel-wise fidelity, our method also preserves fine-grained anatomical structures compared with the original low-field MRI (e.g., left cerebral white matter error drops from 12.1$\%$ to 2.1$\%$, cortex from 4.2$\%$ to 3.7$\%$). To conclude, our CSS-Diff can synthesize images that are both quantitatively reliable and anatomically consistent.
☆ LiFMCR: Dataset and Benchmark for Light Field Multi-Camera Registration
We present LiFMCR, a novel dataset for the registration of multiple micro lens array (MLA)-based light field cameras. While existing light field datasets are limited to single-camera setups and typically lack external ground truth, LiFMCR provides synchronized image sequences from two high-resolution Raytrix R32 plenoptic cameras, together with high-precision 6-degrees of freedom (DoF) poses recorded by a Vicon motion capture system. This unique combination enables rigorous evaluation of multi-camera light field registration methods. As a baseline, we provide two complementary registration approaches: a robust 3D transformation estimation via a RANSAC-based method using cross-view point clouds, and a plenoptic PnP algorithm estimating extrinsic 6-DoF poses from single light field images. Both explicitly integrate the plenoptic camera model, enabling accurate and scalable multi-camera registration. Experiments show strong alignment with the ground truth, supporting reliable multi-view light field processing. Project page: https://lifmcr.github.io/
comment: Accepted at the International Symposium on Visual Computing (ISVC) 2025
☆ NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval.In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
☆ Circle of Willis Centerline Graphs: A Dataset and Baseline Algorithm
The Circle of Willis (CoW) is a critical network of arteries in the brain, often implicated in cerebrovascular pathologies. Voxel-level segmentation is an important first step toward an automated CoW assessment, but a full quantitative analysis requires centerline representations. However, conventional skeletonization techniques often struggle to extract reliable centerlines due to the CoW's complex geometry, and publicly available centerline datasets remain scarce. To address these challenges, we used a thinning-based skeletonization algorithm to extract and curate centerline graphs and morphometric features from the TopCoW dataset, which includes 200 stroke patients, each imaged with MRA and CTA. The curated graphs were used to develop a baseline algorithm for centerline and feature extraction, combining U-Net-based skeletonization with A* graph connection. Performance was evaluated on a held-out test set, focusing on anatomical accuracy and feature robustness. Further, we used the extracted features to predict the frequency of fetal PCA variants, confirm theoretical bifurcation optimality relations, and detect subtle modality differences. The baseline algorithm consistently reconstructed graph topology with high accuracy (F1 = 1), and the average Euclidean node distance between reference and predicted graphs was below one voxel. Features such as segment radius, length, and bifurcation ratios showed strong robustness, with median relative errors below 5% and Pearson correlations above 0.95. Our results demonstrate the utility of learning-based skeletonization combined with graph connection for anatomically plausible centerline extraction. We emphasize the importance of going beyond simple voxel-based measures by evaluating anatomical accuracy and feature robustness. The dataset and baseline algorithm have been released to support further method development and clinical research.
☆ Dedelayed: Deleting remote inference delay via on-device correction
Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.
☆ MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.
comment: Project page: https://minjung-s.github.io/mvcustom
☆ Risk-adaptive Activation Steering for Safe Multimodal Large Language Models
One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.
☆ Generating healthy counterfactuals with denoising diffusion bridge models
Generating healthy counterfactuals from pathological images holds significant promise in medical imaging, e.g., in anomaly detection or for application of analysis tools that are designed for healthy scans. These counterfactuals should represent what a patient's scan would plausibly look like in the absence of pathology, preserving individual anatomical characteristics while modifying only the pathological regions. Denoising diffusion probabilistic models (DDPMs) have become popular methods for generating healthy counterfactuals of pathology data. Typically, this involves training on solely healthy data with the assumption that a partial denoising process will be unable to model disease regions and will instead reconstruct a closely matched healthy counterpart. More recent methods have incorporated synthetic pathological images to better guide the diffusion process. However, it remains challenging to guide the generative process in a way that effectively balances the removal of anomalies with the retention of subject-specific features. To solve this problem, we propose a novel application of denoising diffusion bridge models (DDBMs) - which, unlike DDPMs, condition the diffusion process not only on the initial point (i.e., the healthy image), but also on the final point (i.e., a corresponding synthetically generated pathological image). Treating the pathological image as a structurally informative prior enables us to generate counterfactuals that closely match the patient's anatomy while selectively removing pathology. The results show that our DDBM outperforms previously proposed diffusion models and fully supervised approaches at segmentation and anomaly detection tasks.
FlashWorld: High-quality 3D Scene Generation within Seconds
We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.
comment: Project Page: https://imlixinyang.github.io/FlashWorld-Project-Page/
☆ Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning
Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. In this work, we propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.
☆ NTIRE 2025 Challenge on Low Light Image Enhancement: Methods and Results CVPR
This paper presents a comprehensive review of the NTIRE 2025 Low-Light Image Enhancement (LLIE) Challenge, highlighting the proposed solutions and final outcomes. The objective of the challenge is to identify effective networks capable of producing brighter, clearer, and visually compelling images under diverse and challenging conditions. A remarkable total of 762 participants registered for the competition, with 28 teams ultimately submitting valid entries. This paper thoroughly evaluates the state-of-the-art advancements in LLIE, showcasing the significant progress.
comment: CVPR NTIRE 2025 Workshop, please refer to https://openaccess.thecvf.com/CVPR2025_workshops/NTIRE
☆ CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas
Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.
☆ OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild NeurIPS 2025
Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.
comment: Accepted to NeurIPS 2025; Project page: \url{https://github.com/quhongyu/OmniGaze}
☆ Towards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection
Foundation models such as DINOv2 have shown strong performance in few-shot anomaly detection, yet two key questions remain unexamined: (i) how susceptible are these detectors to adversarial perturbations; and (ii) how well do their anomaly scores reflect calibrated uncertainty? Building on AnomalyDINO, a training-free deep nearest-neighbor detector over DINOv2 features, we present one of the first systematic studies of adversarial attacks and uncertainty estimation in this setting. To enable white-box gradient attacks while preserving test-time behavior, we attach a lightweight linear head to frozen DINOv2 features only for crafting perturbations. Using this heuristic, we evaluate the impact of FGSM across the MVTec-AD and VisA datasets and observe consistent drops in F1, AUROC, AP, and G-mean, indicating that imperceptible perturbations can flip nearest-neighbor relations in feature space to induce confident misclassification. Complementing robustness, we probe reliability and find that raw anomaly scores are poorly calibrated, revealing a gap between confidence and correctness that limits safety-critical use. As a simple, strong baseline toward trustworthiness, we apply post-hoc Platt scaling to the anomaly scores for uncertainty estimation. The resulting calibrated posteriors yield significantly higher predictive entropy on adversarially perturbed inputs than on clean ones, enabling a practical flagging mechanism for attack detection while reducing calibration error (ECE). Our findings surface concrete vulnerabilities in DINOv2-based few-shot anomaly detectors and establish an evaluation protocol and baseline for robust, uncertainty-aware anomaly detection. We argue that adversarial robustness and principled uncertainty quantification are not optional add-ons but essential capabilities if anomaly detection systems are to be trustworthy and ready for real-world deployment.
comment: 10 pages, 5 figures, 3 tables
☆ Challenges, Advances, and Evaluation Metrics in Medical Image Enhancement: A Systematic Literature Review
Medical image enhancement is crucial for improving the quality and interpretability of diagnostic images, ultimately supporting early detection, accurate diagnosis, and effective treatment planning. Despite advancements in imaging technologies such as X-ray, CT, MRI, and ultrasound, medical images often suffer from challenges like noise, artifacts, and low contrast, which limit their diagnostic potential. Addressing these challenges requires robust preprocessing, denoising algorithms, and advanced enhancement methods, with deep learning techniques playing an increasingly significant role. This systematic literature review, following the PRISMA approach, investigates the key challenges, recent advancements, and evaluation metrics in medical image enhancement. By analyzing findings from 39 peer-reviewed studies, this review provides insights into the effectiveness of various enhancement methods across different imaging modalities and the importance of evaluation metrics in assessing their impact. Key issues like low contrast and noise are identified as the most frequent, with MRI and multi-modal imaging receiving the most attention, while specialized modalities such as histopathology, endoscopy, and bone scintigraphy remain underexplored. Out of the 39 studies, 29 utilize conventional mathematical methods, 9 focus on deep learning techniques, and 1 explores a hybrid approach. In terms of image quality assessment, 18 studies employ both reference-based and non-reference-based metrics, 9 rely solely on reference-based metrics, and 12 use only non-reference-based metrics, with a total of 65 IQA metrics introduced, predominantly non-reference-based. This review highlights current limitations, research gaps, and potential future directions for advancing medical image enhancement.
☆ AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset
Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.
☆ LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
☆ Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues
Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of high-quality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0{\deg} to 75{\deg}, and all-day, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable high-level contextual information. To meet the challenge raised by such diverse conditions, we propose a novel prompt-guided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.
☆ XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation
Depth estimation remains central to autonomous driving, and radar-camera fusion offers robustness in adverse conditions by providing complementary geometric cues. In this paper, we present XD-RCDepth, a lightweight architecture that reduces the parameters by 29.7% relative to the state-of-the-art lightweight baseline while maintaining comparable accuracy. To preserve performance under compression and enhance interpretability, we introduce two knowledge-distillation strategies: an explainability-aligned distillation that transfers the teacher's saliency structure to the student, and a depth-distribution distillation that recasts depth regression as soft classification over discretized bins. Together, these components reduce the MAE compared with direct training with 7.97% and deliver competitive accuracy with real-time efficiency on nuScenes and ZJU-4DRadarCam datasets.
comment: Submitted to ICASSP 2026
☆ An efficient approach with theoretical guarantees to simultaneously reconstruct activity and attenuation sinogram for TOF-PET
In positron emission tomography (PET), it is indispensable to perform attenuation correction in order to obtain the quantitatively accurate activity map (tracer distribution) in the body. Generally, this is carried out based on the estimated attenuation map obtained from computed tomography or magnetic resonance imaging. However, except for errors in the attenuation correction factors obtained, the additional scan not only brings in new radiation doses and/or increases the scanning time but also leads to severe misalignment induced by various motions during and between the two sequential scans. To address these issues, based on maximum likelihood estimation, we propose a new mathematical model for simultaneously reconstructing the activity and attenuation sinogram from the time-of-flight (TOF)-PET emission data only. Particularly, we make full use of the exclusively exponential form for the attenuation correction factors, and consider the constraint of a total amount of the activity in some mask region in the proposed model. Furthermore, we prove its well-posedness, including the existence, uniqueness and stability of the solution. We propose an alternating update algorithm to solve the model, and also analyze its convergence. Finally, numerical experiments with various TOF-PET emission data demonstrate that the proposed method is of numerical convergence and robust to noise, and outperforms some state-of-the-art methods in terms of accuracy and efficiency, and has the capability of autonomous attenuation correction.
comment: 32 pages, 11 figures, 4 tables
☆ Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents
Facial expression recognition (FER) must remain robust under both cultural variation and perceptually degraded visual conditions, yet most existing evaluations assume homogeneous data and high-quality imagery. We introduce an agent-based, streaming benchmark that reveals how cross-cultural composition and progressive blurring interact to shape face recognition robustness. Each agent operates in a frozen CLIP feature space with a lightweight residual adapter trained online at sigma=0 and fixed during testing. Agents move and interact on a 5x5 lattice, while the environment provides inputs with sigma-scheduled Gaussian blur. We examine monocultural populations (Western-only, Asian-only) and mixed environments with balanced (5/5) and imbalanced (8/2, 2/8) compositions, as well as different spatial contact structures. Results show clear asymmetric degradation curves between cultural groups: JAFFE (Asian) populations maintain higher performance at low blur but exhibit sharper drops at intermediate stages, whereas KDEF (Western) populations degrade more uniformly. Mixed populations exhibit intermediate patterns, with balanced mixtures mitigating early degradation, but imbalanced settings amplify majority-group weaknesses under high blur. These findings quantify how cultural composition and interaction structure influence the robustness of FER as perceptual conditions deteriorate.
comment: Accepted for presentation at the International Symposium on Agentic Artificial Intelligence Systems (AAIS 2025)
☆ Accelerated Feature Detectors for Visual SLAM: A Comparative Study of FPGA vs GPU
Feature detection is a common yet time-consuming module in Simultaneous Localization and Mapping (SLAM) implementations, which are increasingly deployed on power-constrained platforms, such as drones. Graphics Processing Units (GPUs) have been a popular accelerator for computer vision in general, and feature detection and SLAM in particular. On the other hand, System-on-Chips (SoCs) with integrated Field Programmable Gate Array (FPGA) are also widely available. This paper presents the first study of hardware-accelerated feature detectors considering a Visual SLAM (V-SLAM) pipeline. We offer new insights by comparing the best GPU-accelerated FAST, Harris, and SuperPoint implementations against the FPGA-accelerated counterparts on modern SoCs (Nvidia Jetson Orin and AMD Versal). The evaluation shows that when using a non-learning-based feature detector such as FAST and Harris, their GPU implementations, and the GPU-accelerated V-SLAM can achieve better run-time performance and energy efficiency than the FAST and Harris FPGA implementations as well as the FPGA-accelerated V-SLAM. However, when considering a learning-based detector such as SuperPoint, its FPGA implementation can achieve better run-time performance and energy efficiency (up to 3.1$\times$ and 1.4$\times$ improvements, respectively) than the GPU implementation. The FPGA-accelerated V-SLAM can also achieve comparable run-time performance compared to the GPU-accelerated V-SLAM, with better FPS in 2 out of 5 dataset sequences. When considering the accuracy, the results show that the GPU-accelerated V-SLAM is more accurate than the FPGA-accelerated V-SLAM in general. Last but not least, the use of hardware acceleration for feature detection could further improve the performance of the V-SLAM pipeline by having the global bundle adjustment module invoked less frequently without sacrificing accuracy.
comment: 12 pages, 7 figures
☆ Learning Neural Parametric 3D Breast Shape Models for Metrical Surface Reconstruction From Monocular RGB Videos
We present a neural parametric 3D breast shape model and, based on this model, introduce a low-cost and accessible 3D surface reconstruction pipeline capable of recovering accurate breast geometry from a monocular RGB video. In contrast to widely used, commercially available yet prohibitively expensive 3D breast scanning solutions and existing low-cost alternatives, our method requires neither specialized hardware nor proprietary software and can be used with any device that is able to record RGB videos. The key building blocks of our pipeline are a state-of-the-art, off-the-shelf Structure-from-motion pipeline, paired with a parametric breast model for robust and metrically correct surface reconstruction. Our model, similarly to the recently proposed implicit Regensburg Breast Shape Model (iRBSM), leverages implicit neural representations to model breast shapes. However, unlike the iRBSM, which employs a single global neural signed distance function (SDF), our approach -- inspired by recent state-of-the-art face models -- decomposes the implicit breast domain into multiple smaller regions, each represented by a local neural SDF anchored at anatomical landmark positions. When incorporated into our surface reconstruction pipeline, the proposed model, dubbed liRBSM (short for localized iRBSM), significantly outperforms the iRBSM in terms of reconstruction quality, yielding more detailed surface reconstruction than its global counterpart. Overall, we find that the introduced pipeline is able to recover high-quality 3D breast geometry within an error margin of less than 2 mm. Our method is fast (requires less than six minutes), fully transparent and open-source, and -- together with the model -- publicly available at https://rbsm.re-mic.de/local-implicit.
comment: 18 pages, 12 figures
☆ High Semantic Features for the Continual Learning of Complex Emotions: a Lightweight Solution
Incremental learning is a complex process due to potential catastrophic forgetting of old tasks when learning new ones. This is mainly due to transient features that do not fit from task to task. In this paper, we focus on complex emotion recognition. First, we learn basic emotions and then, incrementally, like humans, complex emotions. We show that Action Units, describing facial muscle movements, are non-transient, highly semantical features that outperform those extracted by both shallow and deep convolutional neural networks. Thanks to this ability, our approach achieves interesting results when learning incrementally complex, compound emotions with an accuracy of 0.75 on the CFEE dataset and can be favorably compared to state-of-the-art results. Moreover, it results in a lightweight model with a small memory footprint.
comment: 10 pages, 14 figures
☆ UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.
comment: 12 pages, 6 figures, 11 tables
☆ ExpressNet-MoE: A Hybrid Deep Neural Network for Emotion Recognition
In many domains, including online education, healthcare, security, and human-computer interaction, facial emotion recognition (FER) is essential. Real-world FER is still difficult despite its significance because of some factors such as variable head positions, occlusions, illumination shifts, and demographic diversity. Engagement detection, which is essential for applications like virtual learning and customer services, is frequently challenging due to FER limitations by many current models. In this article, we propose ExpressNet-MoE, a novel hybrid deep learning model that blends both Convolution Neural Networks (CNNs) and Mixture of Experts (MoE) framework, to overcome the difficulties. Our model dynamically chooses the most pertinent expert networks, thus it aids in the generalization and providing flexibility to model across a wide variety of datasets. Our model improves on the accuracy of emotion recognition by utilizing multi-scale feature extraction to collect both global and local facial features. ExpressNet-MoE includes numerous CNN-based feature extractors, a MoE module for adaptive feature selection, and finally a residual network backbone for deep feature learning. To demonstrate efficacy of our proposed model we evaluated on several datasets, and compared with current state-of-the-art methods. Our model achieves accuracies of 74.77% on AffectNet (v7), 72.55% on AffectNet (v8), 84.29% on RAF-DB, and 64.66% on FER-2013. The results show how adaptive our model is and how it may be used to develop end-to-end emotion recognition systems in practical settings. Reproducible codes and results are made publicly accessible at https://github.com/DeeptimaanB/ExpressNet-MoE.
comment: * Current version of the manuscript contains 17 pages including text, 13 figures, and 4 tables. The manuscript is currently under review at a journal
☆ Through the Lens of Doubt: Robust and Efficient Uncertainty Estimation for Visual Place Recognition
Visual Place Recognition (VPR) enables robots and autonomous vehicles to identify previously visited locations by matching current observations against a database of known places. However, VPR systems face significant challenges when deployed across varying visual environments, lighting conditions, seasonal changes, and viewpoints changes. Failure-critical VPR applications, such as loop closure detection in simultaneous localization and mapping (SLAM) pipelines, require robust estimation of place matching uncertainty. We propose three training-free uncertainty metrics that estimate prediction confidence by analyzing inherent statistical patterns in similarity scores from any existing VPR method. Similarity Distribution (SD) quantifies match distinctiveness by measuring score separation between candidates; Ratio Spread (RS) evaluates competitive ambiguity among top-scoring locations; and Statistical Uncertainty (SU) is a combination of SD and RS that provides a unified metric that generalizes across datasets and VPR methods without requiring validation data to select the optimal metric. All three metrics operate without additional model training, architectural modifications, or computationally expensive geometric verification. Comprehensive evaluation across nine state-of-the-art VPR methods and six benchmark datasets confirms that our metrics excel at discriminating between correct and incorrect VPR matches, and consistently outperform existing approaches while maintaining negligible computational overhead, making it deployable for real-time robotic applications across varied environmental conditions with improved precision-recall performance.
☆ VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.
comment: Project page: https://gohyojun15.github.io/VIST3A/
☆ Near-Infrared Hyperspectral Imaging Applications in Food Analysis -- Improving Algorithms and Methodologies
This thesis investigates the application of near-infrared hyperspectral imaging (NIR-HSI) for food quality analysis. The investigation is conducted through four studies operating with five research hypotheses. For several analyses, the studies compare models based on convolutional neural networks (CNNs) and partial least squares (PLS). Generally, joint spatio-spectral analysis with CNNs outperforms spatial analysis with CNNs and spectral analysis with PLS when modeling parameters where chemical and physical visual information are relevant. When modeling chemical parameters with a 2-dimensional (2D) CNN, augmenting the CNN with an initial layer dedicated to performing spectral convolution enhances its predictive performance by learning a spectral preprocessing similar to that applied by domain experts. Still, PLS-based spectral modeling performs equally well for analysis of the mean content of chemical parameters in samples and is the recommended approach. Modeling the spatial distribution of chemical parameters with NIR-HSI is limited by the ability to obtain spatially resolved reference values. Therefore, a study used bulk mean references for chemical map generation of fat content in pork bellies. A PLS-based approach gave non-smooth chemical maps and pixel-wise predictions outside the range of 0-100\%. Conversely, a 2D CNN augmented with a spectral convolution layer mitigated all issues arising with PLS. The final study attempted to model barley's germinative capacity by analyzing NIR spectra, RGB images, and NIR-HSI images. However, the results were inconclusive due to the dataset's low degree of germination. Additionally, this thesis has led to the development of two open-sourced Python packages. The first facilitates fast PLS-based modeling, while the second facilitates very fast cross-validation of PLS and other classical machine learning models with a new algorithm.
comment: PhD thesis
☆ Steerable Conditional Diffusion for Domain Adaptation in PET Image Reconstruction
Diffusion models have recently enabled state-of-the-art reconstruction of positron emission tomography (PET) images while requiring only image training data. However, domain shift remains a key concern for clinical adoption: priors trained on images from one anatomy, acquisition protocol or pathology may produce artefacts on out-of-distribution data. We propose integrating steerable conditional diffusion (SCD) with our previously-introduced likelihood-scheduled diffusion (PET-LiSch) framework to improve the alignment of the diffusion model's prior to the target subject. At reconstruction time, for each diffusion step, we use low-rank adaptation (LoRA) to align the diffusion model prior with the target domain on the fly. Experiments on realistic synthetic 2D brain phantoms demonstrate that our approach suppresses hallucinated artefacts under domain shift, i.e. when our diffusion model is trained on perturbed images and tested on normal anatomy, our approach suppresses the hallucinated structure, outperforming both OSEM and diffusion model baselines qualitatively and quantitatively. These results provide a proof-of-concept that steerable priors can mitigate domain shift in diffusion-based PET reconstruction and motivate future evaluation on real data.
comment: Accepted for oral presentation at IEEE NSS MIC RTSD 2025 (submitted May 2025; accepted July 2025; to be presented Nov 2025)
☆ Beyond Pixels: A Differentiable Pipeline for Probing Neuronal Selectivity in 3D
Visual perception relies on inference of 3D scene properties such as shape, pose, and lighting. To understand how visual sensory neurons enable robust perception, it is crucial to characterize their selectivity to such physically interpretable factors. However, current approaches mainly operate on 2D pixels, making it difficult to isolate selectivity for physical scene properties. To address this limitation, we introduce a differentiable rendering pipeline that optimizes deformable meshes to obtain MEIs directly in 3D. The method parameterizes mesh deformations with radial basis functions and learns offsets and scales that maximize neuronal responses while enforcing geometric regularity. Applied to models of monkey area V4, our approach enables probing neuronal selectivity to interpretable 3D factors such as pose and lighting. This approach bridges inverse graphics with systems neuroscience, offering a way to probe neural selectivity with physically grounded, 3D stimuli beyond conventional pixel-based methods.
comment: Accepted in Symmetry and Geometry in Neural Representations 2025 (Extended Abstract Track)
CoDS: Enhancing Collaborative Perception in Heterogeneous Scenarios via Domain Separation
Collaborative perception has been proven to improve individual perception in autonomous driving through multi-agent interaction. Nevertheless, most methods often assume identical encoders for all agents, which does not hold true when these models are deployed in real-world applications. To realize collaborative perception in actual heterogeneous scenarios, existing methods usually align neighbor features to those of the ego vehicle, which is vulnerable to noise from domain gaps and thus fails to address feature discrepancies effectively. Moreover, they adopt transformer-based modules for domain adaptation, which causes the model inference inefficiency on mobile devices. To tackle these issues, we propose CoDS, a Collaborative perception method that leverages Domain Separation to address feature discrepancies in heterogeneous scenarios. The CoDS employs two feature alignment modules, i.e., Lightweight Spatial-Channel Resizer (LSCR) and Distribution Alignment via Domain Separation (DADS). Besides, it utilizes the Domain Alignment Mutual Information (DAMI) loss to ensure effective feature alignment. Specifically, the LSCR aligns the neighbor feature across spatial and channel dimensions using a lightweight convolutional layer. Subsequently, the DADS mitigates feature distribution discrepancy with encoder-specific and encoder-agnostic domain separation modules. The former removes domain-dependent information and the latter captures task-related information. During training, the DAMI loss maximizes the mutual information between aligned heterogeneous features to enhance the domain separation process. The CoDS employs a fully convolutional architecture, which ensures high inference efficiency. Extensive experiments demonstrate that the CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade-off between detection accuracy and inference efficiency.
comment: Accepted by IEEE Transactions on Mobile Computing
☆ Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter
In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment, two critical challenges in image inpainting that intensify with increasing resolution and texture complexity. Patch-Adapter leverages a two-stage adapter architecture to scale the diffusion model's resolution from 1K to 4K+ without requiring structural overhauls: (1) Dual Context Adapter learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency; and (2) Reference Patch Adapter implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion. This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and Photo-Concept-Bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence.
☆ Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation
Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on https://github.com/xingzhejun/Mask-GRPO
☆ Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.
☆ Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment
WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for enabling non-contact and privacy-preserving human-computer interaction in AIoT environments. However, existing methods often suffer from limited generalization and semantic expressiveness due to the domain-sensitive nature of Channel State Information and the lack of high-level gesture abstraction. To address these challenges, we propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment (GLSDA), which leverages the semantic prior of pre-trained large foundation models to enhance gesture representation learning in both in-domain and cross-domain scenarios. Specifically, we first design a dual-path CSI encoding pipeline that captures geometric and dynamic gesture patterns via CSI-Ratio phase sequences and Doppler spectrograms. These representations are then fed into a Multiscale Semantic Encoder, which learns robust temporal embeddings and aligns them with gesture semantics through cross-modal attention mechanisms. To further enhance category discrimination, we introduce a Semantic-Aware Soft Supervision scheme that encodes inter-class correlations and reduces label ambiguity, especially for semantically similar gestures. Finally, we develop a Robust Dual-Distillation strategy to compress the aligned model into a lightweight student network, jointly distilling intermediate features and semantic-informed soft labels from the teacher model. Extensive experiments on the Widar3.0 benchmark show that GLSDA consistently outperforms state-of-the-art methods in both in-domain and cross-domain gesture recognition tasks, while significantly reducing model size and inference latency. Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.
comment: Accepted by IEEE ICPADS 2025
☆ Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering ICCV-2025
Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.
comment: Accepted at ICCV-2025, project page: https://dynamic-ugsdf.github.io/
☆ DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.
☆ Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity
Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.
☆ Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM) -- which has demonstrated strong performance in image recognition and image-text retrieval tasks -- to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.
comment: Accepted to ACM RecSys 2025 (Spotlight)
☆ No-Reference Rendered Video Quality Assessment: Dataset and Metrics
Quality assessment of videos is crucial for many computer graphics applications, including video games, virtual reality, and augmented reality, where visual performance has a significant impact on user experience. When test videos cannot be perfectly aligned with references or when references are unavailable, the significance of no-reference video quality assessment (NR-VQA) methods is undeniable. However, existing NR-VQA datasets and metrics are primarily focused on camera-captured videos; applying them directly to rendered videos would result in biased predictions, as rendered videos are more prone to temporal artifacts. To address this, we present a large rendering-oriented video dataset with subjective quality annotations, as well as a designed NR-VQA metric specific to rendered videos. The proposed dataset includes a wide range of 3D scenes and rendering settings, with quality scores annotated for various display types to better reflect real-world application scenarios. Building on this dataset, we calibrate our NR-VQA metric to assess rendered video quality by looking at both image quality and temporal stability. We compare our metric to existing NR-VQA metrics, demonstrating its superior performance on rendered videos. Finally, we demonstrate that our metric can be used to benchmark supersampling methods and assess frame generation strategies in real-time rendering.
☆ Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models
Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.
☆ DEF-YOLO: Leveraging YOLO for Concealed Weapon Detection in Thermal Imagin
Concealed weapon detection aims at detecting weapons hidden beneath a person's clothing or luggage. Various imaging modalities like Millimeter Wave, Microwave, Terahertz, Infrared, etc., are exploited for the concealed weapon detection task. These imaging modalities have their own limitations, such as poor resolution in microwave imaging, privacy concerns in millimeter wave imaging, etc. To provide a real-time, 24 x 7 surveillance, low-cost, and privacy-preserved solution, we opted for thermal imaging in spite of the lack of availability of a benchmark dataset. We propose a novel approach and a dataset for concealed weapon detection in thermal imagery. Our YOLO-based architecture, DEF-YOLO, is built with key enhancements in YOLOv8 tailored to the unique challenges of concealed weapon detection in thermal vision. We adopt deformable convolutions at the SPPF layer to exploit multi-scale features; backbone and neck layers to extract low, mid, and high-level features, enabling DEF-YOLO to adaptively focus on localization around the objects in thermal homogeneous regions, without sacrificing much of the speed and throughput. In addition to these simple yet effective key architectural changes, we introduce a new, large-scale Thermal Imaging Concealed Weapon dataset, TICW, featuring a diverse set of concealed weapons and capturing a wide range of scenarios. To the best of our knowledge, this is the first large-scale contributed dataset for this task. We also incorporate focal loss to address the significant class imbalance inherent in the concealed weapon detection task. The efficacy of the proposed work establishes a new benchmark through extensive experimentation for concealed weapon detection in thermal imagery.
☆ Removing Cost Volumes from Optical Flow Estimators ICCV 2025
Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows removing the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being $1.2\times$ faster and having a $6\times$ lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at $20\,\mathrm{FPS}$ using only $500\,\mathrm{MB}$ of GPU memory.
comment: ICCV 2025
☆ Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests ICCV 2025
Our daily life is highly influenced by what we consume and see. Attracting and holding one's attention -- the definition of (visual) interestingness -- is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models' potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o's, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.
comment: ICCV 2025
☆ Self-Augmented Visual Contrastive Decoding
Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.
☆ InstantSfM: Fully Sparse and Parallel Structure-from-Motion
Structure-from-Motion (SfM), a method that recovers camera poses and scene geometry from uncalibrated images, is a central component in robotic reconstruction and simulation. Despite the state-of-the-art performance of traditional SfM methods such as COLMAP and its follow-up work, GLOMAP, naive CPU-specialized implementations of bundle adjustment (BA) or global positioning (GP) introduce significant computational overhead when handling large-scale scenarios, leading to a trade-off between accuracy and speed in SfM. Moreover, the blessing of efficient C++-based implementations in COLMAP and GLOMAP comes with the curse of limited flexibility, as they lack support for various external optimization options. On the other hand, while deep learning based SfM pipelines like VGGSfM and VGGT enable feed-forward 3D reconstruction, they are unable to scale to thousands of input views at once as GPU memory consumption increases sharply as the number of input views grows. In this paper, we unleash the full potential of GPU parallel computation to accelerate each critical stage of the standard SfM pipeline. Building upon recent advances in sparse-aware bundle adjustment optimization, our design extends these techniques to accelerate both BA and GP within a unified global SfM framework. Through extensive experiments on datasets of varying scales (e.g. 5000 images where VGGSfM and VGGT run out of memory), our method demonstrates up to about 40 times speedup over COLMAP while achieving consistently comparable or even improved reconstruction accuracy. Our project page can be found at https://cre185.github.io/InstantSfM/.
☆ Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning NeurIPS 2025
In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation (3D-NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes using only the supervision from labeled (base) 3D classes. The key to this task is to setup the exact correlations between the point representations and their base class labels, as well as the representation correlations between the points from base and novel classes. A coarse or statistical correlation learning may lead to the confusion in novel class inference. lf we impose a causal relationship as a strong correlated constraint upon the learning process, the essential point cloud representations that accurately correspond to the classes should be uncovered. To this end, we introduce a structural causal model (SCM) to re-formalize the 3D-NCD problem and propose a new method, i.e., Joint Learning of Causal Representation and Reasoning. Specifically, we first analyze hidden confounders in the base class representations and the causal relationships between the base and novel classes through SCM. We devise a causal representation prototype that eliminates confounders to capture the causal representations of base classes. A graph structure is then used to model the causal relationships between the base classes' causal representation prototypes and the novel class prototypes, enabling causal reasoning from base to novel classes. Extensive experiments and visualization results on 3D and 2D NCD semantic segmentation demonstrate the superiorities of our method.
comment: Accepted by NeurIPS 2025
☆ Automated document processing system for government agencies using DBNET++ and BART models
An automatic document classification system is presented that detects textual content in images and classifies documents into four predefined categories (Invoice, Report, Letter, and Form). The system supports both offline images (e.g., files on flash drives, HDDs, microSD) and real-time capture via connected cameras, and is designed to mitigate practical challenges such as variable illumination, arbitrary orientation, curved or partially occluded text, low resolution, and distant text. The pipeline comprises four stages: image capture and preprocessing, text detection [1] using a DBNet++ (Differentiable Binarization Network Plus) detector, and text classification [2] using a BART (Bidirectional and Auto-Regressive Transformers) classifier, all integrated within a user interface implemented in Python with PyQt5. The achieved results by the system for text detection in images were good at about 92.88% through 10 hours on Total-Text dataset that involve high resolution images simulate a various and very difficult challenges. The results indicate the proposed approach is effective for practical, mixed-source document categorization in unconstrained imaging scenarios.
comment: 8 pages, 12 figures, article
☆ Universal Image Restoration Pre-training via Masked Degradation Classification
This study introduces a Masked Degradation Classification Pre-Training method (MaskDCPT), designed to facilitate the classification of degradation types in input images, leading to comprehensive image restoration pre-training. Unlike conventional pre-training methods, MaskDCPT uses the degradation type of the image as an extremely weak supervision, while simultaneously leveraging the image reconstruction to enhance performance and robustness. MaskDCPT includes an encoder and two decoders: the encoder extracts features from the masked low-quality input image. The classification decoder uses these features to identify the degradation type, whereas the reconstruction decoder aims to reconstruct a corresponding high-quality image. This design allows the pre-training to benefit from both masked image modeling and contrastive learning, resulting in a generalized representation suited for restoration tasks. Benefit from the straightforward yet potent MaskDCPT, the pre-trained encoder can be used to address universal image restoration and achieve outstanding performance. Implementing MaskDCPT significantly improves performance for both convolution neural networks (CNNs) and Transformers, with a minimum increase in PSNR of 3.77 dB in the 5D all-in-one restoration task and a 34.8% reduction in PIQE compared to baseline in real-world degradation scenarios. It also emergences strong generalization to previously unseen degradation types and levels. In addition, we curate and release the UIR-2.5M dataset, which includes 2.5 million paired restoration samples across 19 degradation types and over 200 degradation levels, incorporating both synthetic and real-world data. The dataset, source code, and models are available at https://github.com/MILab-PKU/MaskDCPT.
☆ MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.
☆ End-to-End Multi-Modal Diffusion Mamba ICCV 2025
Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.
comment: Accepted by ICCV 2025
☆ Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io
comment: 23 pages, 28 figures, 8 tables
☆ Real-Time Crowd Counting for Embedded Systems with Lightweight Architecture
Crowd counting is a task of estimating the number of the crowd through images, which is extremely valuable in the fields of intelligent security, urban planning, public safety management, and so on. However, the existing counting methods have some problems in practical application on embedded systems for these fields, such as excessive model parameters, abundant complex calculations, etc. The practical application of embedded systems requires the model to be real-time, which means that the model is fast enough. Considering the aforementioned problems, we design a super real-time model with a stem-encoder-decoder structure for crowd counting tasks, which achieves the fastest inference compared with state-of-the-arts. Firstly, large convolution kernels in the stem network are used to enlarge the receptive field, which effectively extracts detailed head information. Then, in the encoder part, we use conditional channel weighting and multi-branch local fusion block to merge multi-scale features with low computational consumption. This part is crucial to the super real-time performance of the model. Finally, the feature pyramid networks are added to the top of the encoder to alleviate its incomplete fusion problems. Experiments on three benchmarks show that our network is suitable for super real-time crowd counting on embedded systems, ensuring competitive accuracy. At the same time, the proposed network reasoning speed is the fastest. Specifically, the proposed network achieves 381.7 FPS on NVIDIA GTX 1080Ti and 71.9 FPS on NVIDIA Jetson TX1.
☆ CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation NeurIPS 2025
Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff
comment: Accepted by NeurIPS 2025
☆ FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding
The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large-scale datasets with accurate annotations. However, collecting and annotating real-world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: 1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; 2) Depth maps for real samples computed via state-of-the-art monocular depth estimation; 3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; 4) Studies on synthetic-to-real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV-based 3D urban scene understanding.
comment: 20 pages, 7 figures, 10 tables, data and code available
☆ Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models
Vision-Language-Action (VLA) models have achieved revolutionary progress in robot learning, enabling robots to execute complex physical robot tasks from natural language instructions. Despite this progress, their adversarial robustness remains underexplored. In this work, we propose both adversarial patch attack and corresponding defense strategies for VLA models. We first introduce the Embedding Disruption Patch Attack (EDPA), a model-agnostic adversarial attack that generates patches directly placeable within the camera's view. In comparison to prior methods, EDPA can be readily applied to different VLA models without requiring prior knowledge of the model architecture, or the controlled robotic manipulator. EDPA constructs these patches by (i) disrupting the semantic alignment between visual and textual latent representations, and (ii) maximizing the discrepancy of latent representations between adversarial and corresponding clean visual inputs. Through the optimization of these objectives, EDPA distorts the VLA's interpretation of visual information, causing the model to repeatedly generate incorrect actions and ultimately result in failure to complete the given robotic task. To counter this, we propose an adversarial fine-tuning scheme for the visual encoder, in which the encoder is optimized to produce similar latent representations for both clean and adversarially perturbed visual inputs. Extensive evaluations on the widely recognized LIBERO robotic simulation benchmark demonstrate that EDPA substantially increases the task failure rate of cutting-edge VLA models, while our proposed defense effectively mitigates this degradation. The codebase is accessible via the homepage at https://edpa-attack.github.io/.
☆ EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking
Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.
☆ UniVector: Unified Vector Extraction via Instance-Geometry Interaction
Vector extraction retrieves structured vector geometry from raster images, offering high-fidelity representation and broad applicability. Existing methods, however, are usually tailored to a single vector type (e.g., polygons, polylines, line segments), requiring separate models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connections) independently, limiting the ability to capture complex structures. Inspired by the human brain's simultaneous use of semantic and spatial interactions in visual perception, we propose UniVector, a unified VE framework that leverages instance-geometry interaction to extract multiple vector types within a single model. UniVector encodes vectors as structured queries containing both instance- and geometry-level information, and iteratively updates them through an interaction module for cross-level context exchange. A dynamic shape constraint further refines global structures and key points. To benchmark multi-structure scenarios, we introduce the Multi-Vector dataset with diverse polygons, polylines, and line segments. Experiments show UniVector sets a new state of the art on both single- and multi-structure VE tasks. Code and dataset will be released at https://github.com/yyyyll0ss/UniVector.
☆ What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.
comment: 38 pages
☆ Sample-Centric Multi-Task Learning for Detection and Segmentation of Industrial Surface Defects
Industrial surface defect inspection for sample-wise quality control (QC) must simultaneously decide whether a given sample contains defects and localize those defects spatially. In real production lines, extreme foreground-background imbalance, defect sparsity with a long-tailed scale distribution, and low contrast are common. As a result, pixel-centric training and evaluation are easily dominated by large homogeneous regions, making it difficult to drive models to attend to small or low-contrast defects-one of the main bottlenecks for deployment. Empirically, existing models achieve strong pixel-overlap metrics (e.g., mIoU) but exhibit insufficient stability at the sample level, especially for sparse or slender defects. The root cause is a mismatch between the optimization objective and the granularity of QC decisions. To address this, we propose a sample-centric multi-task learning framework and evaluation suite. Built on a shared-encoder architecture, the method jointly learns sample-level defect classification and pixel-level mask localization. Sample-level supervision modulates the feature distribution and, at the gradient level, continually boosts recall for small and low-contrast defects, while the segmentation branch preserves boundary and shape details to enhance per-sample decision stability and reduce misses. For evaluation, we propose decision-linked metrics, Seg_mIoU and Seg_Recall, which remove the bias of classical mIoU caused by empty or true-negative samples and tightly couple localization quality with sample-level decisions. Experiments on two benchmark datasets demonstrate that our approach substantially improves the reliability of sample-level decisions and the completeness of defect localization.
☆ Prompt-based Adaptation in Large-scale Vision Models: A Survey
In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity -- pixel-level and token-level. Beyond the core methodologies, we examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.
☆ MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation
Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.
☆ Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences
The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.
☆ Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion
Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released https://github.com/VitaLemonTea1/CIGOcc
☆ STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control
Edge Gaussian splatting (EGS), which aggregates data from distributed clients and trains a global GS model at the edge server, is an emerging paradigm for scene reconstruction. Unlike traditional edge resource management methods that emphasize communication throughput or general-purpose learning performance, EGS explicitly aims to maximize the GS qualities, rendering existing approaches inapplicable. To address this problem, this paper formulates a novel GS-oriented objective function that distinguishes the heterogeneous view contributions of different clients. However, evaluating this function in turn requires clients' images, leading to a causality dilemma. To this end, this paper further proposes a sample-then-transmit EGS (or STT-GS for short) strategy, which first samples a subset of images as pilot data from each client for loss prediction. Based on the first-stage evaluation, communication resources are then prioritized towards more valuable clients. To achieve efficient sampling, a feature-domain clustering (FDC) scheme is proposed to select the most representative data and pilot transmission time minimization (PTTM) is adopted to reduce the pilot overhead.Subsequently, we develop a joint client selection and power control (JCSPC) framework to maximize the GS-oriented function under communication resource constraints. Despite the nonconvexity of the problem, we propose a low-complexity efficient solution based on the penalty alternating majorization minimization (PAMM) algorithm. Experiments unveil that the proposed scheme significantly outperforms existing benchmarks on real-world datasets. It is found that the GS-oriented objective can be accurately predicted with low sampling ratios (e.g.,10%), and our method achieves an excellent tradeoff between view contributions and communication costs.
☆ DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization
Transient Electromagnetic (TEM) method is widely used in various geophysical applications, providing valuable insights into subsurface properties. However, time-domain TEM signals are often submerged in various types of noise. While recent deep learning-based denoising models have shown strong performance, these models are mostly trained on simulated or single real-world scenario data, overlooking the significant differences in noise characteristics from different geographical regions. Intuitively, models trained in one environment often struggle to perform well in new settings due to differences in geological conditions, equipment, and external interference, leading to reduced denoising performance. To this end, we propose the Dictionary-driven Prior Regularization Test-time Adaptation (DP-TTA). Our key insight is that TEM signals possess intrinsic physical characteristics, such as exponential decay and smoothness, which remain consistent across different regions regardless of external conditions. These intrinsic characteristics serve as ideal prior knowledge for guiding the TTA strategy, which helps the pre-trained model dynamically adjust parameters by utilizing self-supervised losses, improving denoising performance in new scenarios. To implement this, we customized a network, named DTEMDNet. Specifically, we first use dictionary learning to encode these intrinsic characteristics as a dictionary-driven prior, which is integrated into the model during training. At the testing stage, this prior guides the model to adapt dynamically to new environments by minimizing self-supervised losses derived from the dictionary-driven consistency and the signal one-order variation. Extensive experimental results demonstrate that the proposed method achieves much better performance than existing TEM denoising methods and TTA methods.
☆ Foveation Improves Payload Capacity in Steganography SIGGRAPH
Steganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography.
comment: SIGGRAPH Asia 2025 Posters Proceedings
☆ Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN
This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical implementation.This project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.
☆ OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment
Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrieval of these two modalities. To address this particular challenge, we propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap and reproduce the alignment ability of humans in these tasks. Our entropy-enhancing alignment is achieved through a two-step process: 1) a new prompt template that does not rely on explicit knowledge in the task domain is designed to use LLM to enhance the polysemy description of the text modality. By analogy, the information entropy of the text modality relative to the visual modality is increased; 2) A hypergraph adapter is used to construct multilateral connections between the text and image modalities, which can correct the positive and negative matching errors for synonymous semantics in the same fixed embedding space, whilst reducing the noise caused by open semantic entropy by mapping the reduced dimensions back to the original dimensions. Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter), showcasing 16.8\% (text-to-image) and 40.1\% (image-to-text) cross-modal retrieval gains over existing methods while establishing new state-of-the-art performance in semantic alignment tasks.
☆ VPREG: An Optimal Control Formulation for Diffeomorphic Image Registration Based on the Variational Principle Grid Generation Method
This paper introduces VPreg, a novel diffeomorphic image registration method. This work provides several improvements to our past work on mesh generation and diffeomorphic image registration. VPreg aims to achieve excellent registration accuracy while controlling the quality of the registration transformations. It ensures a positive Jacobian determinant of the spatial transformation and provides an accurate approximation of the inverse of the registration, a crucial property for many neuroimaging workflows. Unlike conventional methods, VPreg generates this inverse transformation within the group of diffeomorphisms rather than operating on the image space. The core of VPreg is a grid generation approach, referred to as \emph{Variational Principle} (VP), which constructs non-folding grids with prescribed Jacobian determinant and curl. These VP-generated grids guarantee diffeomorphic spatial transformations essential for computational anatomy and morphometry, and provide a more accurate inverse than existing methods. To assess the potential of the proposed approach, we conduct a performance analysis for 150 registrations of brain scans from the OASIS-1 dataset. Performance evaluation based on Dice scores for 35 regions of interest, along with an empirical analysis of the properties of the computed spatial transformations, demonstrates that VPreg outperforms state-of-the-art methods in terms of Dice scores, regularity properties of the computed transformation, and accuracy and consistency of the provided inverse map. We compare our results to ANTs-SyN, Freesurfer-Easyreg, and FSL-Fnirt.
comment: 30 pages, 9 figures
☆ DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models
Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.
comment: 9 pages, 3 figures
☆ EgoSocial: Benchmarking Proactive Intervention Ability of Omnimodal LLMs via Egocentric Social Interaction Perception
As AR/VR technologies become integral to daily life, there's a growing need for AI that understands human social dynamics from an egocentric perspective. However, current LLMs often lack the social awareness to discern when to intervene as AI assistant. This leads to constant, socially unaware responses that may disrupt natural conversation and negatively impact user focus. To address these limitations, we introduce EgoSocial, a large-scale egocentric dataset with 13,500 social video-question pairs, specifically designed to benchmark intervention in social interaction perception. We also present an in-depth analysis of current omnimodal LLMs (OLLMs) to assess their effectiveness in detecting diverse social contextual cues. Experiments show that OLLMs still struggle to detect the intervention timing (14.4% for Gemini 2.5 Pro). We also propose EgoSoD (EgoSocial Detection), an end-to-end method for robustly discerning social dynamics. Informed by our OLLM analysis, EgoSoD integrates multimodal contextual cues (e.g., audio and visual cues) into a social thinking graph, dynamically modeling participants and interactions. Our method proactively detects intervention timing and social interactions, precisely determining when to intervene. Our EgoSoD improves Phi-4 by 45.6% and Gemini 2.5 Pro by 9.9% on Intervention Timing performance, and improves Phi-4 by 20.4% and Gemini 2.5 Pro by 6.9% on overall Social Interaction performance. We will release the dataset and code soon.
☆ Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation
Text-to-image (T2I) diffusion models have recently demonstrated significant progress in video editing. However, existing video editing methods are severely limited by their high computational overhead and memory consumption. Furthermore, these approaches often sacrifice visual fidelity, leading to undesirable temporal inconsistencies and artifacts such as blurring and pronounced mosaic-like patterns. We propose Edit-Your-Interest, a lightweight, text-driven, zero-shot video editing method. Edit-Your-Interest introduces a spatio-temporal feature memory to cache features from previous frames, significantly reducing computational overhead compared to full-sequence spatio-temporal modeling approaches. Specifically, we first introduce a Spatio-Temporal Feature Memory bank (SFM), which is designed to efficiently cache and retain the crucial image tokens processed by spatial attention. Second, we propose the Feature Most-Similar Propagation (FMP) method. FMP propagates the most relevant tokens from previous frames to subsequent ones, preserving temporal consistency. Finally, we introduce an SFM update algorithm that continuously refreshes the cached features, ensuring their long-term relevance and effectiveness throughout the video sequence. Furthermore, we leverage cross-attention maps to automatically extract masks for the instances of interest. These masks are seamlessly integrated into the diffusion denoising process, enabling fine-grained control over target objects and allowing Edit-Your-Interest to perform highly accurate edits while robustly preserving the background integrity. Extensive experiments decisively demonstrate that the proposed Edit-Your-Interest outperforms state-of-the-art methods in both efficiency and visual fidelity, validating its superior effectiveness and practicality.
comment: 32 pages, 11 figures
☆ Counting Hallucinations in Diffusion Models
Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.
☆ Unsupervised Domain Adaptation via Content Alignment for Hippocampus Segmentation
Deep learning models for medical image segmentation often struggle when deployed across different datasets due to domain shifts - variations in both image appearance, known as style, and population-dependent anatomical characteristics, referred to as content. This paper presents a novel unsupervised domain adaptation framework that directly addresses domain shifts encountered in cross-domain hippocampus segmentation from MRI, with specific emphasis on content variations. Our approach combines efficient style harmonisation through z-normalisation with a bidirectional deformable image registration (DIR) strategy. The DIR network is jointly trained with segmentation and discriminator networks to guide the registration with respect to a region of interest and generate anatomically plausible transformations that align source images to the target domain. We validate our approach through comprehensive evaluations on both a synthetic dataset using Morpho-MNIST (for controlled validation of core principles) and three MRI hippocampus datasets representing populations with varying degrees of atrophy. Across all experiments, our method outperforms existing baselines. For hippocampus segmentation, when transferring from young, healthy populations to clinical dementia patients, our framework achieves up to 15% relative improvement in Dice score compared to standard augmentation methods, with the largest gains observed in scenarios with substantial content shift. These results highlight the efficacy of our approach for accurate hippocampus segmentation across diverse populations.
☆ Direction-aware multi-scale gradient loss for infrared and visible image fusion
Infrared and visible image fusion aims to integrate complementary information from co-registered source images to produce a single, informative result. Most learning-based approaches train with a combination of structural similarity loss, intensity reconstruction loss, and a gradient-magnitude term. However, collapsing gradients to their magnitude removes directional information, yielding ambiguous supervision and suboptimal edge fidelity. We introduce a direction-aware, multi-scale gradient loss that supervises horizontal and vertical components separately and preserves their sign across scales. This axis-wise, sign-preserving objective provides clear directional guidance at both fine and coarse resolutions, promoting sharper, better-aligned edges and richer texture preservation without changing model architectures or training protocols. Experiments on open-source model and multiple public benchmarks demonstrate effectiveness of our approach.
☆ True Self-Supervised Novel View Synthesis is Transferable
In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.
☆ Towards Reversible Model Merging For Low-rank Weights
Model merging aims to combine multiple fine-tuned models into a single set of weights that performs well across all source tasks. While prior work has shown that merging can approximate the performance of individual fine-tuned models for each task, it largely overlooks scenarios where models are compressed into low-rank representations, either through low-rank adaptation (LoRA) or post-training singular value decomposition (SVD). We first demonstrate that applying conventional merging methods to low-rank weights leads to severe performance degradation in the merged model. Motivated by this phenomenon, we propose a fundamentally different approach: instead of collapsing all adapters into one set of weights, we construct a compact basis (e.g., an equivalent of holding two or more models) from which original task-specific models can be recovered via linear combination. This reframes merging as generating a reconstruction-capable model space rather than producing a single merged model. Crucially, this allows us to ``revert'' to each individual model when needed, recognizing that no merged model can consistently outperform one specialized for its task. Building on this insight, we introduce our method, Reversible Model Merging (RMM), an efficient, data-free, and flexible method that provides a closed-form solution for selecting the optimal basis of model weights and task-specific coefficients for linear combination. Extensive experiments across diverse datasets and model scales demonstrate that RMM consistently outperforms existing merging approaches, preserving the performance of low-rank compressed models by a significant margin.
☆ PoissonNet: A Local-Global Approach for Learning on Surfaces SIGGRAPH
Many network architectures exist for learning on meshes, yet their constructions entail delicate trade-offs between difficulty learning high-frequency features, insufficient receptive field, sensitivity to discretization, and inefficient computational overhead. Drawing from classic local-global approaches in mesh processing, we introduce PoissonNet, a novel neural architecture that overcomes all of these deficiencies by formulating a local-global learning scheme, which uses Poisson's equation as the primary mechanism for feature propagation. Our core network block is simple; we apply learned local feature transformations in the gradient domain of the mesh, then solve a Poisson system to propagate scalar feature updates across the surface globally. Our local-global learning framework preserves the features's full frequency spectrum and provides a truly global receptive field, while remaining agnostic to mesh triangulation. Our construction is efficient, requiring far less compute overhead than comparable methods, which enables scalability -- both in the size of our datasets, and the size of individual training samples. These qualities are validated on various experiments where, compared to previous intrinsic architectures, we attain state-of-the-art performance on semantic segmentation and parameterizing highly-detailed animated surfaces. Finally, as a central application of PoissonNet, we show its ability to learn deformations, significantly outperforming state-of-the-art architectures that learn on surfaces.
comment: In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 2025, 16 pages
☆ cubic: CUDA-accelerated 3D Bioimage Computing ICCV 2025
Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic's API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic
comment: accepted to BioImage Computing workshop @ ICCV 2025
☆ Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images
We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
☆ Synchronization of Multiple Videos ICCV 2025
Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/
comment: ICCV 2025
☆ Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding NeurIPS 2025
Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.
comment: NeurIPS 2025 (Spotlight). Webpage at https://xiaoqian-shen.github.io/Vgent
☆ NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.
☆ Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer
Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model's performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen's kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen's kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen's kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen's kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.
Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models
Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.
comment: 11 pages, 7 figures, 8 tables. To be published in Applied AI Letters
☆ Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems ICLR 2025
Recovering true signals from noisy measurements is a central challenge in inverse problems spanning medical imaging, geophysics, and signal processing. Current solutions balance prior assumptions regarding the true signal (regularization) with agreement to noisy measured data (data-fidelity). Conventional data-fidelity loss functions, such as mean-squared error (MSE) or negative log-likelihood, seek pointwise agreement with noisy measurements, often leading to overfitting to noise. In this work, we instead evaluate data-fidelity collectively by testing whether the observed measurements are statistically consistent with the noise distributions implied by the current estimate. We adopt this aggregated perspective and introduce distributional consistency (DC) loss, a data-fidelity objective that replaces pointwise matching with distribution-level calibration using model-based probability scores for each measurement. DC loss acts as a direct and practical plug-in replacement for standard data consistency terms: i) it is compatible with modern regularizers, ii) it is optimized in the same way as traditional losses, and iii) it avoids overfitting to measurement noise even without the use of priors. Its scope naturally fits many practical inverse problems where the measurement-noise distribution is known and where the measured dataset consists of many independent noisy values. We demonstrate efficacy in two key example application areas: i) in image denoising with deep image prior, using DC instead of MSE loss removes the need for early stopping and achieves higher PSNR; ii) in medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances the efficacy of hand-crafted regularization. These results position DC loss as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for inverse problems.
comment: Preprint; submitted to ICLR 2025 for possible publication
☆ Weight Weaving: Parameter Pooling for Data-Free Model Merging NeurIPS 2025
Model merging provides a cost-effective and data-efficient combination of specialized deep neural networks through parameter integration. This technique leverages expert models across downstream tasks without requiring retraining. Most model merging approaches critically depend on scaling hyper-parameters $\lambda$, which weight each model's contribution globally or individually. Principled approaches for setting scaling factors without accessing any data (data-free) are scarce, often leading researchers to tune $\lambda$ using privileged data from the evaluation set, which is obviously unfeasible in practice. To address this limitation, we introduce Weight Weaving, a plug-and-play technique that pools model weights across $\lambda$ values search space using user-defined pooling functions, such as averaging, random selection, or even existing model merging methods. Our method demonstrates high modularity, imposing minimal constraints on the search space. It operates orthogonally to existing model merging methods and eliminates evaluation data requirements. We validate Weight Weaving across three ViT variants in three experimental setups: vision multi-task learning, vision continual learning, and domain generalization. Our method consistently improves the performance of several model merging methods, achieving average accuracy gains of up to 15.9 percentage points in a data-free setting.
comment: 17 pages, 3 figures. Accepted at the 3rd UniReps Workshop @ NeurIPS 2025
♻ ☆ SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.
comment: Technical Report
♻ ☆ EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels
Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at https://github.com/KPeng9510/ERELIFM.
comment: The source code is available at https://github.com/KPeng9510/ERELIFM
♻ ☆ MCOP: Multi-UAV Collaborative Occupancy Prediction
Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird's Eye View (BEV)-based approaches exhibit two main limitations: bounding-box representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects. To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead. Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.
♻ ☆ Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.
♻ ☆ Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries
Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery. Stereo matching with binocular endoscopy can provide this depth information. However, existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries and typically fail to meet real-time requirements for high-resolution endoscopic image inputs. To address these challenges, we propose \textbf{RRESM}, a real-time stereo matching method tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate Attention module that enhances cost aggregation through position-sensitive attention maps and long-range spatial dependency modeling via the Mamba block, generating a robust cost volume without substantial computational overhead. Additionally, we introduce a High-Frequency Disparity Optimization module that refines disparity predictions near tissue boundaries by amplifying high-frequency details in the wavelet domain. Evaluations on the SCARED and SERV-CT datasets demonstrate state-of-the-art matching accuracy with a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/RRESM.
♻ ☆ MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.
♻ ☆ VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
♻ ☆ Software System for Low-Cost, GUI-Based Microscopy Segmentation: Algorithmic Implementation
This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical application-particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility.
♻ ☆ $Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization NeurIPS2025
Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.
comment: Accepted by NeurIPS2025
♻ ☆ ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning NeurIPS 2025
Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI. Our code is publicly available at https://github.com/chaudatascience/cha_mae_vit.
comment: Accepted to NeurIPS 2025
♻ ☆ Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition
This paper presents a comprehensive overview of the Ultralytics YOLO(You Only Look Once) family of object detectors, focusing the architectural evolution, benchmarking, deployment perspectives, and future challenges. The review begins with the most recent release, YOLO26 (or YOLOv26), which introduces key innovations including Distribution Focal Loss (DFL) removal, native NMS-free inference, Progressive Loss Balancing (ProgLoss), Small-Target-Aware Label Assignment (STAL), and the MuSGD optimizer for stable training. The progression is then traced through YOLO11, with its hybrid task assignment and efficiency-focused modules; YOLOv8, which advanced with a decoupled detection head and anchor-free predictions; and YOLOv5, which established the modular PyTorch foundation that enabled modern YOLO development. Benchmarking on the MS COCO dataset provides a detailed quantitative comparison of YOLOv5, YOLOv8, YOLO11, and YOLO26 (YOLOv26), alongside cross-comparisons with YOLOv12, YOLOv13, RT-DETR, and DEIM(DETR with Improved Matching). Metrics including precision, recall, F1 score, mean Average Precision, and inference speed are analyzed to highlight trade-offs between accuracy and efficiency. Deployment and application perspectives are further discussed, covering export formats, quantization strategies, and real-world use in robotics, agriculture, surveillance, and manufacturing. Finally, the paper identifies challenges and future directions, including dense-scene limitations, hybrid CNN-Transformer integration, open-vocabulary detection, and edge-aware training approaches. (Object Detection, YOLOv26, YOLO)
♻ ☆ QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots
Panoramic cameras, capturing comprehensive 360-degree environmental data, are suitable for quadruped robots in surrounding perception and interaction with complex environments. However, the scarcity of high-quality panoramic training data-caused by inherent kinematic constraints and complex sensor calibration challenges-fundamentally limits the development of robust perception systems tailored to these embodied platforms. To address this issue, we propose QuaDreamer-the first panoramic data generation engine specifically designed for quadruped robots. QuaDreamer focuses on mimicking the motion paradigm of quadruped robots to generate highly controllable, realistic panoramic videos, providing a data source for downstream tasks. Specifically, to effectively capture the unique vertical vibration characteristics exhibited during quadruped locomotion, we introduce Vertical Jitter Encoding (VJE). VJE extracts controllable vertical signals through frequency-domain feature filtering and provides high-quality prompts. To facilitate high-quality panoramic video generation under jitter signal control, we propose a Scene-Object Controller (SOC) that effectively manages object motion and boosts background jitter control through the attention mechanism. To address panoramic distortions in wide-FoV video generation, we propose the Panoramic Enhancer (PE)-a dual-stream architecture that synergizes frequency-texture refinement for local detail enhancement with spatial-structure correction for global geometric consistency. We further demonstrate that the generated video sequences can serve as training data for the quadruped robot's panoramic visual perception model, enhancing the performance of multi-object tracking in 360-degree scenes. The source code and model weights will be publicly available at https://github.com/losehu/QuaDreamer.
comment: Accepted to CoRL 2025. The source code and model weights will be publicly available at https://github.com/losehu/QuaDreamer
♻ ☆ Towards Methane Detection Onboard Satellites
Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.
♻ ☆ Health Care Waste Classification Using Deep Learning Aligned with Nepal's Bin Color Guidelines
The increasing number of Health Care facilities in Nepal has added up the challenges on managing health care waste (HCW). Improper segregation and disposal of HCW leads to contamination, spreading of infectious diseases and risk for waste handlers. This study benchmarks the state of the art waste classification models: ResNeXt-50, EfficientNet-B0, MobileNetV3-S, YOLOv8-n and YOLOv5-s using stratified 5-fold cross-validation technique on combined HCW data. YOLOv5-s achieved the highest accuracy (95.06%) but fell short with the YOLOv8-n model in inference speed with few milliseconds. The EfficientNet-B0 showed promising results of 93.22% accuracy but took the highest inference time. Following a repetitive ANOVA test to confirm the statistical significance, the best performing model (YOLOv5-s) was deployed to the web with bin color mapped using Nepal's HCW management standards. Further work is suggested to address data limitation and ensure localized context.
comment: 7 pages, 5 figures
♻ ☆ Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning
The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
comment: 34 pages, 25 figures
♻ ☆ RelTopo: Multi-Level Relational Modeling for Driving Scene Topology Reasoning
Accurate road topology reasoning is critical for autonomous driving, enabling effective navigation and adherence to traffic regulations. Central to this task are lane perception and topology reasoning. However, existing methods typically focus on either lane detection or Lane-to-Lane (L2L) topology reasoning, often \textit{neglecting} Lane-to-Traffic-element (L2T) relationships or \textit{failing} to optimize these tasks jointly. Furthermore, most approaches either overlook relational modeling or apply it in a limited scope, despite the inherent spatial relationships among road elements. We argue that relational modeling is beneficial for both perception and reasoning, as humans naturally leverage contextual relationships for road element recognition and their connectivity inference. To this end, we introduce relational modeling into both perception and reasoning, \textit{jointly} enhancing structural understanding. Specifically, we propose: 1) a relation-aware lane detector, where our geometry-biased self-attention and \curve\ cross-attention refine lane representations by capturing relational dependencies; 2) relation-enhanced topology heads, including a geometry-enhanced L2L head and a cross-view L2T head, boosting reasoning with relational cues; and 3) a contrastive learning strategy with InfoNCE loss to regularize relationship embeddings. Extensive experiments on OpenLane-V2 demonstrate that our approach significantly improves both detection and topology reasoning metrics, achieving +3.1 in DET$_l$, +5.3 in TOP$_{ll}$, +4.9 in TOP$_{lt}$, and an overall +4.4 in OLS, setting a new state-of-the-art. Code will be released.
comment: Preprint. Under review
♻ ☆ DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability
Despite deep learning's broad success, its abstract-reasoning bottleneck persists. We tackle Raven's Progressive Matrices (RPM), the benchmark for pattern, reasoning and problem-solving intelligence. We model the full causal chain image $\rightarrow$ attributes $\rightarrow$ progressive patterns $\rightarrow$ consistency $\rightarrow$ answer and build the baseline DIO. Yet DIO's mutual-information lower-bound objective does not embed human logic: the bound is loose and statistic-based, ignoring causal subject-object links. We therefore present three refinements. 1) Brando introduces trainable negative options to tighten the variational bound. 2) WORLD replaces generation with a Gaussian-mixture feature model that supplies infinite, weighted negatives, further tightening the bound. 3) DIEGO adds metadata supervision to rectify the "attributes $\rightarrow$ patterns" semantic gap, aligning representations with human rules. These upgrades substantially boost discriminative RPM accuracy and, for the first time, let DIO generate valid answers in open-ended RPM. The work provides causal-driven design guidelines, objective-refinement strategies and cross-modal insights for abstract-reasoning research.
comment: 15 pages, 9 figures, 8 tables
♻ ☆ MULTI: Multimodal Understanding Leaderboard with Text and Images
The rapid development of multimodal large language models (MLLMs) raises the question of how they compare to human performance. While existing datasets often feature synthetic or overly simplistic tasks, some models have already surpassed human expert baselines. In this paper, we present MULTI, a Chinese multimodal dataset derived from authentic examination questions. Comprising over 18,000 carefully selected and refined questions, MULTI evaluates models using real-world examination standards, encompassing image-text comprehension, complex reasoning, and knowledge recall. Additionally, We also introduce MULTI-Elite, a 500-question selected hard subset, and MULTI-Extend with more than 4,500 external knowledge context pieces for testing in-context learning capabilities. Our evaluation highlights substantial room for MLLM advancement, with Qwen2-VL-72B achieving a 76.9% accuracy on MULTI and 53.1% on MULTI-Elite leading 25 evaluated models, compared to human expert baselines of 86.1% and 73.1%. MULTI serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
comment: 24 pages, 19 figures, 10 tables. Details and access are available at: https://OpenDFM.github.io/MULTI-Benchmark/
♻ ☆ MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
♻ ☆ MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering ACM MM 2025
Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at https://megc2025.github.io.
comment: Micro-Expression Grand Challenge (MEGC) at ACM MM 2025
♻ ☆ FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition
Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.
♻ ☆ PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.
♻ ☆ SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology CVPR 2025
With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.
comment: CVPR 2025, EarthVision workshop
♻ ☆ TBStar-Edit: From Image Editing Pattern Shifting to Consistency Enhancement
Recent advances in image generation and editing technologies have enabled state-of-the-art models to achieve impressive results in general domains. However, when applied to e-commerce scenarios, these general models often encounter consistency limitations. To address this challenge, we introduce TBStar-Edit, an new image editing model tailored for the e-commerce domain. Through rigorous data engineering, model architecture design and training strategy, TBStar-Edit achieves precise and high-fidelity image editing while maintaining the integrity of product appearance and layout. Specifically, for data engineering, we establish a comprehensive data construction pipeline, encompassing data collection, construction, filtering, and augmentation, to acquire high-quality, instruction-following, and strongly consistent editing data to support model training. For model architecture design, we design a hierarchical model framework consisting of a base model, pattern shifting modules, and consistency enhancement modules. For model training, we adopt a two-stage training strategy to enhance the consistency preservation: first stage for editing pattern shifting, and second stage for consistency enhancement. Each stage involves training different modules with separate datasets. Finally, we conduct extensive evaluations of TBStar-Edit on a self-proposed e-commerce benchmark, and the results demonstrate that TBStar-Edit outperforms existing general-domain editing models in both objective metrics (VIE Score) and subjective user preference.
♻ ☆ Systematic Literature Review on Vehicular Collaborative Perception - A Computer Vision Perspective
The effectiveness of autonomous vehicles relies on reliable perception capabilities. Despite significant advancements in artificial intelligence and sensor fusion technologies, current single-vehicle perception systems continue to encounter limitations, notably visual occlusions and limited long-range detection capabilities. Collaborative Perception (CP), enabled by Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communication, has emerged as a promising solution to mitigate these issues and enhance the reliability of autonomous systems. Beyond advancements in communication, the computer vision community is increasingly focusing on improving vehicular perception through collaborative approaches. However, a systematic literature review that thoroughly examines existing work and reduces subjective bias is still lacking. Such a systematic approach helps identify research gaps, recognize common trends across studies, and inform future research directions. In response, this study follows the PRISMA 2020 guidelines and includes 106 peer-reviewed articles. These publications are analyzed based on modalities, collaboration schemes, and key perception tasks. Through a comparative analysis, this review illustrates how different methods address practical issues such as pose errors, temporal latency, communication constraints, domain shifts, heterogeneity, and adversarial attacks. Furthermore, it critically examines evaluation methodologies, highlighting a misalignment between current metrics and CP's fundamental objectives. By delving into all relevant topics in-depth, this review offers valuable insights into challenges, opportunities, and risks, serving as a reference for advancing research in vehicular collaborative perception.
comment: 38 pages, 8 figures
♻ ☆ Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion methods and VLMs in the field of robot vision. For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks. Meanwhile, we also analyze the architectural characteristics and practical implementations of these fusion strategies in key tasks such as simultaneous localization and mapping (SLAM), 3D object detection, navigation, and manipulation. We compare the evolutionary paths and applicability of VLMs based on large language models (LLMs) with traditional multimodal fusion methods.Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Building on this analysis, we identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation. We propose future directions such as self-supervised learning for robust multimodal representations, structured spatial memory and environment modeling to enhance spatial intelligence, and the integration of adversarial robustness and human feedback mechanisms to enable ethically aligned system deployment. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.
comment: 27 pages, 11 figures. Accepted to Information Fusion. Final journal version: volume 126 (Part B), February 2026
♻ ☆ MedDINOv3: How to adapt vision foundation models for medical image segmentation?
Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.
♻ ☆ SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models
In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as "Clear and Day" weather, leading to decreased performance in under-represented conditions like "Rainy and Night". To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model. We release our code and pipeline at https://github.com/UTAustin-SwarmLab/SynDiff-AD.
comment: 15 pages, 10 figures
♻ ☆ Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation
Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architectures. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method. Our code is available at https://github.com/jimmylin0979/PAT.git.
♻ ☆ ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom
Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., limited multi-modal reasoning capacities, and insufficient and irrelevant visual descriptions). We then decompose visual reasoning process into two stages: proactive visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features decoupled vision-reasoning capabilities and multi-run proactive perception. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks for both open-source and closed-source models, with the average performance gain reaching 13.2%. Besides, the integration of LLMs allows ProReason to produce high-quality visual reasoning data, which empowers ProReason-distilled models (i.e., ProReason-VL and ProReason-Q3) to achieve superior performance in downstream tasks. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.
♻ ☆ A Simple Framework for Open-Vocabulary Zero-Shot Segmentation ICLR 2025
Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
comment: ICLR 2025
♻ ☆ Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This "reason first, then act" process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at: https://github.com/Geo-R1/geo-r1.
♻ ☆ End-to-End Semantic Preservation in Text-Aware Image Compression Systems
Traditional image compression methods aim to reconstruct images for human perception, prioritizing visual fidelity over task relevance. In contrast, Coding for Machines focuses on preserving information essential for automated understanding. Building on this principle, we present an end-to-end compression framework that retains text-specific features for Optical Character Recognition (OCR). The encoder operates at roughly half the computational cost of the OCR module, making it suitable for resource-limited devices. When on-device OCR is infeasible, images can be efficiently compressed and later decoded to recover textual content. Experiments show significant improvements in text extraction accuracy at low bitrates, even outperforming OCR on uncompressed images. We further extend this study to general-purpose encoders, exploring their capacity to preserve hidden semantics under extreme compression. Instead of optimizing for visual fidelity, we examine whether compact, visually degraded representations can retain recoverable meaning through learned enhancement and recognition modules. Results demonstrate that semantic information can persist despite severe compression, bridging text-oriented compression and general-purpose semantic preservation in machine-centered image coding.
♻ ☆ MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query NeurIPS 2025
Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
comment: NeurIPS 2025; Project Page, Code, and Dataset at: https://merit-2025.github.io/
♻ ☆ VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field. However, in the context of Underwater Image Enhancement (UIE), the standard sequential scanning mechanism is fundamentally challenged by the unique statistical distribution characteristics of underwater scenes. The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets, thereby impeding effective state propagation and compromising the model's ability to preserve both local semantics and global structure. To address this limitation, we propose a novel Value-Driven Reordering Scanning framework for UIE, termed VRS-UIE. Its core innovation is a Multi-Granularity Value Guidance Learning (MVGL) module that generates a pixel-aligned value map to dynamically reorder the SSM's scanning sequence. This prioritizes informative regions to facilitate the long-range state propagation of salient features. Building upon the MVGL, we design a Mamba-Conv Mixer (MCM) block that synergistically integrates priority-driven global sequencing with dynamically adjusted local convolutions, thereby effectively modeling both large-portion oceanic backgrounds and high-value semantic targets. A Cross-Feature Bridge (CFB) further refines multi-level feature fusion. Extensive experiments demonstrate that our VRS-UIE framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity. Furthermore, by incorporating efficient convolutional operators and resolution rescaling, we construct a light-weight yet effective scheme, VRS-UIE-S, suitable for real-time UIE applications.
♻ ☆ FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.
comment: EMNLP 2025 Main. Project page: https://ahnjaewoo.github.io/flashadventure
♻ ☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
♻ ☆ PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis
Recent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.
♻ ☆ Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
♻ ☆ SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models
Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge. To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality. SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48\% across various attack scenarios. Moreover, instead of refusing to generate or producing black images for unsafe prompts, SafeGuider generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures. We hope that SafeGuider can shed some light on the practical deployment of secure text-to-image systems.
comment: Accepted by ACM CCS 2025, Code is available at [this https URL](https://github.com/pgqihere/safeguider)
♻ ☆ On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
♻ ☆ Endoscopic Depth Estimation Based on Deep Learning: A Survey
Endoscopic depth estimation is a critical technology for improving the safety and precision of minimally invasive surgery. It has attracted considerable attention from researchers in medical imaging, computer vision, and robotics. Over the past decade, a large number of methods have been developed. Despite the existence of several related surveys, a comprehensive overview focusing on recent deep learning-based techniques is still limited. This paper endeavors to bridge this gap by systematically reviewing the state-of-the-art literature. Specifically, we provide a thorough survey of the field from three key perspectives: data, methods, and applications. Firstly, at the data level, we describe the acquisition process of publicly available datasets. Secondly, at the methodological level, we introduce both monocular and stereo deep learning-based approaches for endoscopic depth estimation. Thirdly, at the application level, we identify the specific challenges and corresponding solutions for the clinical implementation of depth estimation technology, situated within concrete clinical scenarios. Finally, we outline potential directions for future research, such as domain adaptation, real-time implementation, and the synergistic fusion of depth information with sensor technologies, thereby providing a valuable starting point for researchers to engage with and advance the field toward clinical translation.
♻ ☆ Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
♻ ☆ Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation
AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.
comment: APSIPA ASC 2025 Accepted
♻ ☆ Position: The Artificial Intelligence and Machine Learning Community Should Adopt a More Transparent and Regulated Peer Review Process ICML 2025
The rapid growth of submissions to top-tier Artificial Intelligence (AI) and Machine Learning (ML) conferences has prompted many venues to transition from closed to open review platforms. Some have fully embraced open peer reviews, allowing public visibility throughout the process, while others adopt hybrid approaches, such as releasing reviews only after final decisions or keeping reviews private despite using open peer review systems. In this work, we analyze the strengths and limitations of these models, highlighting the growing community interest in transparent peer review. To support this discussion, we examine insights from Paper Copilot, a website launched two years ago to aggregate and analyze AI / ML conference data while engaging a global audience. The site has attracted over 200,000 early-career researchers, particularly those aged 18-34 from 177 countries, many of whom are actively engaged in the peer review process. Drawing on our findings, this position paper advocates for a more transparent, open, and well-regulated peer review aiming to foster greater community involvement and propel advancements in the field.
comment: ICML 2025; https://papercopilot.com/
♻ ☆ Spatio-Temporal LLM: Reasoning about Environments and Actions
Despite significant recent progress of Multimodal Large Language Models (MLLMs), current MLLMs are challenged by "spatio-temporal" prompts, i.e., prompts that refer to 1) the entirety of an environment encoded in a point cloud that the MLLM should consider; and simultaneously also refer to 2) actions that happened in part of the environment and are encoded in a short ego-centric video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent MLLMs indeed struggle to correctly answer "spatio-temporal" prompts. Building on this dataset, we study two spatio-temporal LLM (STLLM) baselines: 1) STLLM-3D, which directly fuses point cloud, video, and text representations as inputs to the LLM; and 2) STLLM-Aligner, which aligns spatial context with video and text before LLM decoding. Both baselines aim to enhance spatial understanding of environments and temporal grounding of egocentric observations. On REA, the STLLM baselines outperform existing models, demonstrating the effectiveness of our designs. Code and data are available at https://zoezheng126.github.io/STLLM-website/.
comment: Code and data are available at https://zoezheng126.github.io/STLLM-website/
♻ ☆ TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces three key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases; and (iii) a seed group strategy that controls for initialization effects to isolate exploration contributions. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and text-to-image benchmarks.
♻ ☆ AquaCluster: Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation
In recent years, the wide availability of high-resolution radar satellite images has enabled the remote monitoring of wetland surface areas. Machine learning models have achieved state-of-the-art results in segmenting wetlands from satellite images. However, these models require large amounts of manually annotated satellite images, which are slow and expensive to produce. The need for annotated training data makes it difficult to adapt these models to changes such as different climates or sensors. To address this issue, we employed self-supervised training methods to develop a model, AquaCluster, which segments radar satellite images into water and land areas without manual annotations. Our final model outperformed other radar-based water detection techniques that do not require annotated data in our test dataset, having achieved a 0.08 improvement in the Intersection over Union metric. Our results demonstrate that it is possible to train machine learning models to detect vegetated water from radar images without the use of annotated data, which can make the retraining of these models to account for changes much easier.
comment: 19 pages, 6 figures
♻ ☆ Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like `bouba' with round shapes and `kiki' with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both model variants lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.
comment: Presented at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (2025)
♻ ☆ Improving Transferability of Adversarial Examples via Bayesian Attacks
The transferability of adversarial examples allows for the attack on unknown deep neural networks (DNNs), posing a serious threat to many applications and attracting great attention. In this paper, we improve the transferability of adversarial examples by incorporating the Bayesian formulation into both the model parameters and model input, enabling their joint diversification. We demonstrate that combination of Bayesian formulations for both the model input and model parameters yields significant improvements in transferability. By introducing advanced approximations of the posterior distribution over the model input, adversarial transferability achieves further enhancement, surpassing all state-of-the-arts when attacking without model fine-tuning. Additionally, we propose a principled approach to fine-tune model parameters within this Bayesian framework. Extensive experiments demonstrate that our method achieves a new state-of-the-art in transfer-based attacks, significantly improving the average success rate on ImageNet and CIFAR-10. Code at: https://github.com/qizhangli/MoreBayesian-jrnl.
comment: Accepted by TCSVT
♻ ☆ Semantically Guided Action Anticipation
Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. Our method defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. We empirically demonstrate our method's superiority in domain adaptation tasks across four diverse image and video datasets. Remarkably, we surpass previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.
♻ ☆ Identifying Hard Noise in Long-Tailed Sample Distribution ECCV2022
Conventional de-noising methods rely on the assumption that all samples are independent and identically distributed, so the resultant classifier, though disturbed by noise, can still easily identify the noises as the outliers of training distribution. However, the assumption is unrealistic in large-scale data that is inevitably long-tailed. Such imbalanced training data makes a classifier less discriminative for the tail classes, whose previously "easy" noises are now turned into "hard" ones -- they are almost as outliers as the clean tail samples. We introduce this new challenge as Noisy Long-Tailed Classification (NLT). Not surprisingly, we find that most de-noising methods fail to identify the hard noises, resulting in significant performance drop on the three proposed NLT benchmarks: ImageNet-NLT, Animal10-NLT, and Food101-NLT. To this end, we design an iterative noisy learning framework called Hard-to-Easy (H2E). Our bootstrapping philosophy is to first learn a classifier as noise identifier invariant to the class and context distributional changes, reducing "hard" noises to "easy" ones, whose removal further improves the invariance. Experimental results show that our H2E outperforms state-of-the-art de-noising methods and their ablations on long-tailed settings while maintaining a stable performance on the conventional balanced settings. Datasets and codes are available at https://github.com/yxymessi/H2E-Framework
comment: Accepted to ECCV2022(Oral) ; Datasets and codes are available at https://github.com/yxymessi/H2E-Framework
♻ ☆ MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
comment: https://mbzuai-oryx.github.io/MATRIX/
♻ ☆ Towards Generalized Video Quality Assessment: A Weak-to-Strong Learning Paradigm
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with human-labeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. Moreover, its reliance on human annotations -- which are labor-intensive and costly -- makes it difficult to scale datasets for improving model generalization. In this work, we explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on large-scale human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a distinct weak-to-strong effect in VQA. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) integrating homogeneous and heterogeneous supervision signals from diverse VQA teachers -- including off-the-shelf VQA models and synthetic distortion simulators -- via a learn-to-rank formulation, and (2) iterative W2S training, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in VQA, with implications extending to broader alignment and evaluation tasks.
♻ ☆ Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease
Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on ADNI with only 1,504 training MRIs, outperforming methods trained on 27,161 MRIs, and shows strong zero-shot generalization on OASIS-2 and AIBL. Code is available at https://github.com/CFQ666312/DEFT-VLM-AD.
comment: Accepted at MICAD 2025
♻ ☆ Axis-level Symmetry Detection with Group-Equivariant Representation ICCV 2025
Symmetry is a fundamental concept that has been extensively studied, yet detecting it in complex scenes remains a significant challenge in computer vision. Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes. In this work, we propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation-by representing them as explicit geometric primitives, i.e. lines and points. Our method employs a dual-branch architecture that is equivariant to the dihedral group, with each branch specialized to exploit the structure of dihedral group-equivariant features for its respective symmetry type. For reflection symmetry, we introduce orientational anchors, aligned with group components, to enable orientation-specific detection, and a reflectional matching that measures similarity between patterns and their mirrored counterparts across candidate axes. For rotational symmetry, we propose a rotational matching that compares patterns at fixed angular intervals to identify rotational centers. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches.
comment: Accepted to ICCV 2025
♻ ☆ GranQ: Efficient Channel-wise Quantization via Vectorized Pre-Scaling for Zero-Shot QAT
Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on synthetic inputs generated from the full-precision model. However, these synthetic inputs often lead to activation distortion, especially under low-bit settings. To mitigate this, existing methods typically employ per-channel scaling, but they still struggle due to the severe computational overhead during the accumulation process. To overcome this critical bottleneck, we propose GranQ, a novel activation quantization framework that introduces an efficient pre-scaling strategy. Unlike conventional channel-wise methods that repeatedly perform scaling operations during accumulation, GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead. This design enables GranQ to maintain fine-grained quantization accuracy while significantly reducing computational burden, particularly in low-bit quantization settings. Extensive experiments under quantization-aware training (QAT) settings demonstrate that GranQ consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet. In particular, our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10.
♻ ☆ RealEngine: Simulating Autonomous Driving in Realistic Context
Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.
♻ ☆ GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity NeurIPS 2025
Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.
comment: NeurIPS 2025
♻ ☆ SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking
Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.
♻ ☆ ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on https://github.com/uchiha99999/ESG-Net.
♻ ☆ Diffusion-Classifier Synergy: Reward-Aligned Learning via Mutual Boosting Loop for FSCIL NeurIPS 2025
Few-Shot Class-Incremental Learning (FSCIL) challenges models to sequentially learn new classes from minimal examples without forgetting prior knowledge, a task complicated by the stability-plasticity dilemma and data scarcity. Current FSCIL methods often struggle with generalization due to their reliance on limited datasets. While diffusion models offer a path for data augmentation, their direct application can lead to semantic misalignment or ineffective guidance. This paper introduces Diffusion-Classifier Synergy (DCS), a novel framework that establishes a mutual boosting loop between diffusion model and FSCIL classifier. DCS utilizes a reward-aligned learning strategy, where a dynamic, multi-faceted reward function derived from the classifier's state directs the diffusion model. This reward system operates at two levels: the feature level ensures semantic coherence and diversity using prototype-anchored maximum mean discrepancy and dimension-wise variance matching, while the logits level promotes exploratory image generation and enhances inter-class discriminability through confidence recalibration and cross-session confusion-aware mechanisms. This co-evolutionary process, where generated images refine the classifier and an improved classifier state yields better reward signals, demonstrably achieves state-of-the-art performance on FSCIL benchmarks, significantly enhancing both knowledge retention and new class learning.
comment: Accepted by NeurIPS 2025
♻ ☆ MIRROR: Multimodal Cognitive Reframing Therapy for Rolling with Resistance
Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, which allows the AI therapist to better align its responses with the client's negative emotional state. Specifically, we introduce a new synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which is a novel synthetic dataset that pairs each client's statements with corresponding facial images. Using this dataset, we train baseline vision language models (VLMs) so that they can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage client resistance. These models are then evaluated in terms of both their counseling skills as a therapist, and the strength of therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist's ability to handle resistance, which outperforms existing text-based CBT approaches. Human expert evaluations further confirm the effectiveness of our approach in managing client resistance and fostering therapeutic alliance.
comment: EMNLP 2025 Main
♻ ☆ TMT: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation
Recent advances in Vision Transformers (ViTs) have significantly advanced semantic segmentation performance. However, their adaptation to new target domains remains challenged by distribution shifts, which often disrupt global attention mechanisms. While existing global and patch-level adaptation methods offer some improvements, they overlook the spatially varying transferability inherent in different image regions. To address this, we propose the Transferable Mask Transformer (TMT), a region-adaptive framework designed to enhance cross-domain representation learning through transferability guidance. First, we dynamically partition the image into coherent regions, grouped by structural and semantic similarity, and estimates their domain transferability at a localized level. Then, we incorporate region-level transferability maps directly into the self-attention mechanism of ViTs, allowing the model to adaptively focus attention on areas with lower transferability and higher semantic uncertainty. Extensive experiments across 20 diverse cross-domain settings demonstrate that TMT not only mitigates the performance degradation typically associated with domain shift but also consistently outperforms existing approaches.
♻ ☆ Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection
While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on `in-the-wild' benchmarks. Instead of crafting another specialized `knife' for this problem, we bring a `gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively `outguns' bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20\%. Our analysis pinpoints the source of the VFM's `firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., `AI-generated'), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM's pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world `gunfight' of AI-generated image detection, the raw `firepower' of an updated VFM is far more effective than the `craftsmanship' of a static detector. 2) True generalization evaluation requires test data to be independent of the model's entire training history, including pre-training.
♻ ☆ DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
MLLMs have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of RL in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modeling. First, we adopt a standard reasoning reward encouraging the model to include three-step reasoning process: 1) comprehending entire visual scene, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to encourage MLLM to examine uncertain regions during the observing process, guiding it to inspect ambiguous areas and mitigate perceptual uncertainty. This reward promotes variance-driven visual exploration, enabling MLLM to reason about region-level uncertainty and explicitly indicate interpretable uncertain regions. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We verify its effectiveness across diverse fine-grained object detection data consisting of challenging real-world scenes, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios, outperforming various existing baselines and SFT method. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.
♻ ☆ IterMask3D: Unsupervised Anomaly Detection and Segmentation with Test-Time Iterative Mask Refinement in 3D Brain MR
Unsupervised anomaly detection and segmentation methods train a model to learn the training distribution as `normal'. In the testing phase, they identify patterns that deviate from this normal distribution as `anomalies'. To learn the `normal' distribution, prevailing methods corrupt the images and train a model to reconstruct them. During testing, the model attempts to reconstruct corrupted inputs based on the learned `normal' distribution. Deviations from this distribution lead to high reconstruction errors, which indicate potential anomalies. However, corrupting an input image inevitably causes information loss even in normal regions, leading to suboptimal reconstruction and an increased risk of false positives. To alleviate this, we propose $\rm{IterMask3D}$, an iterative spatial mask-refining strategy designed for 3D brain MRI. We iteratively spatially mask areas of the image as corruption and reconstruct them, then shrink the mask based on reconstruction error. This process iteratively unmasks `normal' areas to the model, whose information further guides reconstruction of `normal' patterns under the mask to be reconstructed accurately, reducing false positives. In addition, to achieve better reconstruction performance, we also propose using high-frequency image content as additional structural information to guide the reconstruction of the masked area. Extensive experiments on the detection of both synthetic and real-world imaging artifacts, as well as segmentation of various pathological lesions across multiple MRI sequences, consistently demonstrate the effectiveness of our proposed method. Code is available at https://github.com/ZiyunLiang/IterMask3D.
comment: Published in Medical Image Analysis
♻ ☆ FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
With the increasing awareness of health and the growing desire for aesthetic physique, fitness has become a prevailing trend. However, the potential risks associated with fitness training, especially with weight-loaded fitness actions, cannot be overlooked. Action Quality Assessment (AQA), a technology that quantifies the quality of human action and provides feedback, holds the potential to assist fitness enthusiasts of varying skill levels in achieving better training outcomes. Nevertheless, current AQA methodologies and datasets are limited to single-view competitive sports scenarios and RGB modality and lack professional assessment and guidance of fitness actions. To address this gap, we propose the FLEX dataset, the first multi-modal, multi-action, large-scale dataset that incorporates surface electromyography (sEMG) signals into AQA. FLEX utilizes high-precision MoCap to collect 20 different weight-loaded actions performed by 38 subjects across 3 different skill levels for 10 repetitions each, containing 5 different views of the RGB video, 3D pose, sEMG, and physiological information. Additionally, FLEX incorporates knowledge graphs into AQA, constructing annotation rules in the form of penalty functions that map weight-loaded actions, action keysteps, error types, and feedback. We conducted various baseline methodologies on FLEX, demonstrating that multimodal data, multiview data, and fine-grained annotations significantly enhance model performance. FLEX not only advances AQA methodologies and datasets towards multi-modal and multi-action scenarios but also fosters the integration of artificial intelligence within the fitness domain. Dataset and code are available at https://haoyin116.github.io/FLEX_Dataset.
♻ ☆ HUMOTO: A 4D Dataset of Mocap Human Object Interactions ICCV 2025
We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 735 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://jiaxin-lu.github.io/humoto/ .
comment: ICCV 2025, 19 pages, 15 figures
♻ ☆ Jigsaw++: Imagining Complete Shape Priors for Object Reassembly ICCV 2025
The automatic assembly problem has attracted increasing interest due to its complex challenges that involve 3D representation. This paper introduces Jigsaw++, a novel generative method designed to tackle the multifaceted challenges of reconstructing complete shape for the reassembly problem. Existing approach focusing primarily on piecewise information for both part and fracture assembly, often overlooking the integration of complete object prior. Jigsaw++ distinguishes itself by learning a shape prior of complete objects. It employs the proposed "retargeting" strategy that effectively leverages the output of any existing assembly method to generate complete shape reconstructions. This capability allows it to function orthogonally to the current methods. Through extensive evaluations on Breaking Bad dataset and PartNet, Jigsaw++ has demonstrated its effectiveness, reducing reconstruction errors and enhancing the precision of shape reconstruction, which sets a new direction for future reassembly model developments.
comment: ICCV 2025, 21 pages, 10 figures
♻ ☆ VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis
We present VoxelPrompt, an end-to-end image analysis agent that tackles free-form radiological tasks. Given any number of volumetric medical images and a natural language prompt, VoxelPrompt integrates a language model that generates executable code to invoke a jointly-trained, adaptable vision network. This code further carries out analytical steps to address practical quantitative aims, such as measuring the growth of a tumor across visits. The pipelines generated by VoxelPrompt automate analyses that currently require practitioners to painstakingly combine multiple specialized vision and statistical tools. We evaluate VoxelPrompt using diverse neuroimaging tasks and show that it can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt performs these objectives with an accuracy similar to that of specialist single-task models for image analysis, while facilitating a broad range of compositional biomedical workflows.
comment: 22 pages, vision-language agent, medical image analysis, neuroimage foundation model
♻ ☆ CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization ICCV 2025
In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.
comment: ICCV 2025, Project Page: https://cl-splats.github.io
♻ ☆ KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.
♻ ☆ MAPS: Masked Attribution-based Probing of Strategies- A computational framework to align human and model explanations
Human core object recognition depends on the selective use of visual information, but the strategies guiding these choices are difficult to measure directly. We present MAPS (Masked Attribution-based Probing of Strategies), a behaviorally validated computational tool that tests whether explanations derived from artificial neural networks (ANNs) can also explain human vision. MAPS converts attribution maps into explanation-masked images (EMIs) and compares image-by-image human accuracies on these minimal images with limited pixel budgets with accuracies on the full stimuli. MAPS provides a principled way to evaluate and choose among competing ANN interpretability methods. In silico, EMI-based behavioral similarity between models reliably recovers the ground-truth similarity computed from their attribution maps, establishing which explanation methods best capture the model's strategy. When applied to humans and macaques, MAPS identifies ANN-explanation combinations whose explanations align most closely with biological vision, achieving the behavioral validity of Bubble masks while requiring far fewer behavioral trials. Because it needs only access to model attributions and a modest set of behavioral data on the original images, MAPS avoids exhaustive psychophysics while offering a scalable tool for adjudicating explanations and linking human behavior, neural activity, and model decisions under a common standard.
♻ ☆ Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
Earth observation (EO) in open-world settings presents a unique challenge: different applications rely on diverse sensor modalities, each with varying ground sampling distances, spectral ranges, and numbers of spectral bands. However, existing EO foundation models are typically tailored to specific sensor types, making them inflexible when generalizing across the heterogeneous landscape of EO data. To address this, we propose the Dynamic One-For-All (DOFA) model, a unified, multimodal foundation framework designed for diverse vision tasks in EO. Inspired by neural plasticity, DOFA utilizes a wavelength-conditioned dynamic hypernetwork to process inputs from five distinct satellite sensors flexibly. By continually pretraining on five EO modalities, DOFA achieves state-of-the-art performance across multiple downstream tasks and generalizes well to unseen modalities. Enhanced with hybrid continual pretraining, DOFA+ requires significantly fewer computational resources while outperforming counterparts trained with extensive GPU budgets. Experiments on diverse datasets highlight DOFA's potential as a foundation for general-purpose vision models in the sensor-diverse EO domain. The code and pre-trained weights are publicly available at https://github.com/zhu-xlab/DOFA.
comment: 18 pages, 8 figures
♻ ☆ Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.
♻ ☆ Real-Time Motion-Controllable Autoregressive Video Diffusion
Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.
♻ ☆ MetaQAP - A Meta-Learning Approach for Quality-Aware Pretraining in Image Quality Assessment
Image Quality Assessment (IQA) is a critical task in a wide range of applications but remains challenging due to the subjective nature of human perception and the complexity of real-world image distortions. This study proposes MetaQAP, a novel no-reference IQA model designed to address these challenges by leveraging quality-aware pre-training and meta-learning. The model performs three key contributions: pre-training Convolutional Neural Networks (CNNs) on a quality-aware dataset, implementing a quality-aware loss function to optimize predictions, and integrating a meta-learner to form an ensemble model that effectively combines predictions from multiple base models. Experimental evaluations were conducted on three benchmark datasets: LiveCD, KonIQ-10K, and BIQ2021. The proposed MetaQAP model achieved exceptional performance with Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on KonIQ-10K, and 0.884/0.8765 on BIQ2021, outperforming existing IQA methods. Cross-dataset evaluations further demonstrated the generalizability of the model, with PLCC and SROCC scores ranging from 0.6721 to 0.8023 and 0.6515 to 0.7805, respectively, across diverse datasets. The ablation study confirmed the significance of each model component, revealing substantial performance degradation when critical elements such as the meta-learner or quality-aware loss function were omitted. MetaQAP not only addresses the complexities of authentic distortions but also establishes a robust and generalizable framework for practical IQA applications. By advancing the state-of-the-art in no-reference IQA, this research provides valuable insights and methodologies for future improvements and extensions in the field.
♻ ☆ Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline
In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.
Machine Learning 276
☆ MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control
MimicKit is an open-source framework for training motion controllers using motion imitation and reinforcement learning. The codebase provides implementations of commonly-used motion-imitation techniques and RL algorithms. This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures. The codebase is designed to be modular and easily configurable, enabling convenient modification and extension to new characters and tasks. The open-source codebase is available at: https://github.com/xbpeng/MimicKit.
☆ NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models
With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose , a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.
comment: code available at: https://github.com/nirgoren/NoisePrints
☆ Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach
Reinforcement learning (RL) for the Markov Decision Process (MDP) has emerged in many security-related applications, such as autonomous driving, financial decisions, and drone/robot algorithms. In order to improve the robustness/defense of RL systems against adversaries, studying various adversarial attacks on RL systems is very important. Most previous work considered deterministic adversarial attack strategies in MDP, which the recipient (victim) agent can defeat by reversing the deterministic attacks. In this paper, we propose a provably ``invincible'' or ``uncounterable'' type of adversarial attack on RL. The attackers apply a rate-distortion information-theoretic approach to randomly change agents' observations of the transition kernel (or other properties) so that the agent gains zero or very limited information about the ground-truth kernel (or other properties) during the training. We derive an information-theoretic lower bound on the recipient agent's reward regret and show the impact of rate-distortion attacks on state-of-the-art model-based and model-free algorithms. We also extend this notion of an information-theoretic approach to other types of adversarial attack, such as state observation attacks.
☆ T3former: Temporal Graph Classification with Topological Machine Learning
Temporal graph classification plays a critical role in applications such as cybersecurity, brain connectivity analysis, social dynamics, and traffic monitoring. Despite its significance, this problem remains underexplored compared to temporal link prediction or node forecasting. Existing methods often rely on snapshot-based or recurrent architectures that either lose fine-grained temporal information or struggle with long-range dependencies. Moreover, local message-passing approaches suffer from oversmoothing and oversquashing, limiting their ability to capture complex temporal structures. We introduce T3former, a novel Topological Temporal Transformer that leverages sliding-window topological and spectral descriptors as first-class tokens, integrated via a specialized Descriptor-Attention mechanism. This design preserves temporal fidelity, enhances robustness, and enables principled cross-modal fusion without rigid discretization. T3former achieves state-of-the-art performance across multiple benchmarks, including dynamic social networks, brain functional connectivity datasets, and traffic networks. It also offers theoretical guarantees of stability under temporal and structural perturbations. Our results highlight the power of combining topological and spectral insights for advancing the frontier of temporal graph learning.
comment: 14 pages, 8 figures
☆ The Art of Scaling Reinforcement Learning Compute for LLMs
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
comment: 28 pages, 20 figures
☆ UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations
Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at https://github.com/DominikM198/UrbanFusion.
☆ Tensor Gaussian Processes: Efficient Solvers for Nonlinear PDEs
Machine learning solvers for partial differential equations (PDEs) have attracted growing interest. However, most existing approaches, such as neural network solvers, rely on stochastic training, which is inefficient and typically requires a great many training epochs. Gaussian process (GP)/kernel-based solvers, while mathematical principled, suffer from scalability issues when handling large numbers of collocation points often needed for challenging or higher-dimensional PDEs. To overcome these limitations, we propose TGPS, a tensor-GP-based solver that models factor functions along each input dimension using one-dimensional GPs and combines them via tensor decomposition to approximate the full solution. This design reduces the task to learning a collection of one-dimensional GPs, substantially lowering computational complexity, and enabling scalability to massive collocation sets. For efficient nonlinear PDE solving, we use a partial freezing strategy and Newton's method to linerize the nonlinear terms. We then develop an alternating least squares (ALS) approach that admits closed-form updates, thereby substantially enhancing the training efficiency. We establish theoretical guarantees on the expressivity of our model, together with convergence proof and error analysis under standard regularity assumptions. Experiments on several benchmark PDEs demonstrate that our method achieves superior accuracy and efficiency compared to existing approaches.
☆ PriorGuide: Test-Time Prior Adaptation for Simulation-Based Inference
Amortized simulator-based inference offers a powerful framework for tackling Bayesian inference in computational fields such as engineering or neuroscience, increasingly leveraging modern generative methods like diffusion models to map observed data to model parameters or future predictions. These approaches yield posterior or posterior-predictive samples for new datasets without requiring further simulator calls after training on simulated parameter-data pairs. However, their applicability is often limited by the prior distribution(s) used to generate model parameters during this training phase. To overcome this constraint, we introduce PriorGuide, a technique specifically designed for diffusion-based amortized inference methods. PriorGuide leverages a novel guidance approximation that enables flexible adaptation of the trained diffusion model to new priors at test time, crucially without costly retraining. This allows users to readily incorporate updated information or expert knowledge post-training, enhancing the versatility of pre-trained inference models.
comment: 35 pages, 6 figures
☆ Progressive multi-fidelity learning for physical system predictions
Highly accurate datasets from numerical or physical experiments are often expensive and time-consuming to acquire, posing a significant challenge for applications that require precise evaluations, potentially across multiple scenarios and in real-time. Even building sufficiently accurate surrogate models can be extremely challenging with limited high-fidelity data. Conversely, less expensive, low-fidelity data can be computed more easily and encompass a broader range of scenarios. By leveraging multi-fidelity information, prediction capabilities of surrogates can be improved. However, in practical situations, data may be different in types, come from sources of different modalities, and not be concurrently available, further complicating the modeling process. To address these challenges, we introduce a progressive multi-fidelity surrogate model. This model can sequentially incorporate diverse data types using tailored encoders. Multi-fidelity regression from the encoded inputs to the target quantities of interest is then performed using neural networks. Input information progressively flows from lower to higher fidelity levels through two sets of connections: concatenations among all the encoded inputs, and additive connections among the final outputs. This dual connection system enables the model to exploit correlations among different datasets while ensuring that each level makes an additive correction to the previous level without altering it. This approach prevents performance degradation as new input data are integrated into the model and automatically adapts predictions based on the available inputs. We demonstrate the effectiveness of the approach on numerical benchmarks and a real-world case study, showing that it reliably integrates multi-modal data and provides accurate predictions, maintaining performance when generalizing across time and parameter variations.
☆ A Complete Pipeline for deploying SNNs with Synaptic Delays on Loihi 2
Spiking Neural Networks are attracting increased attention as a more energy-efficient alternative to traditional Artificial Neural Networks for edge computing. Neuromorphic computing can significantly reduce energy requirements. Here, we present a complete pipeline: efficient event-based training of SNNs with synaptic delays on GPUs and deployment on Intel's Loihi 2 neuromorphic chip. We evaluate our approach on keyword recognition tasks using the Spiking Heidelberg Digits and Spiking Speech Commands datasets, demonstrating that our algorithm can enhance classification accuracy compared to architectures without delays. Our benchmarking indicates almost no accuracy loss between GPU and Loihi 2 implementations, while classification on Loihi 2 is up to 18x faster and uses 250x less energy than on an NVIDIA Jetson Orin Nano.
☆ RECODE: Reasoning Through Code Generation for Visual Question Answering
Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
☆ Optimal Bounds for Tyler's M-Estimator for Elliptical Distributions
A fundamental problem in statistics is estimating the shape matrix of an Elliptical distribution. This generalizes the familiar problem of Gaussian covariance estimation, for which the sample covariance achieves optimal estimation error. For Elliptical distributions, Tyler proposed a natural M-estimator and showed strong statistical properties in the asymptotic regime, independent of the underlying distribution. Numerical experiments show that this estimator performs very well, and that Tyler's iterative procedure converges quickly to the estimator. Franks and Moitra recently provided the first distribution-free error bounds in the finite sample setting, as well as the first rigorous convergence analysis of Tyler's iterative procedure. However, their results exceed the sample complexity of the Gaussian setting by a $\log^{2} d$ factor. We close this gap by proving optimal sample threshold and error bounds for Tyler's M-estimator for all Elliptical distributions, fully matching the Gaussian result. Moreover, we recover the algorithmic convergence even at this lower sample threshold. Our approach builds on the operator scaling connection of Franks and Moitra by introducing a novel pseudorandom condition, which we call $\infty$-expansion. We show that Elliptical distributions satisfy $\infty$-expansion at the optimal sample threshold, and then prove a novel scaling result for inputs satisfying this condition.
comment: 13 pages + proofs in Appendix
☆ Asymptotically optimal reinforcement learning in Block Markov Decision Processes
The curse of dimensionality renders Reinforcement Learning (RL) impractical in many real-world settings with exponentially large state and action spaces. Yet, many environments exhibit exploitable structure that can accelerate learning. To formalize this idea, we study RL in Block Markov Decision Processes (BMDPs). BMDPs model problems with large observation spaces, but where transition dynamics are fully determined by latent states. Recent advances in clustering methods have enabled the efficient recovery of this latent structure. However, a regret analysis that exploits these techniques to determine their impact on learning performance remained open. We are now addressing this gap by providing a regret analysis that explicitly leverages clustering, demonstrating that accurate latent state estimation can indeed effectively speed up learning. Concretely, this paper analyzes a two-phase RL algorithm for BMDPs that first learns the latent structure through random exploration and then switches to an optimism-guided strategy adapted to the uncovered structure. This algorithm achieves a regret that is $O(\sqrt{T}+n)$ on a large class of BMDPs susceptible to clustering. Here, $T$ denotes the number of time steps, $n$ is the cardinality of the observation space, and the Landau notation $O(\cdot)$ holds up to constants and polylogarithmic factors. This improves the best prior bound, $O(\sqrt{T}+n^2)$, especially when $n$ is large. Moreover, we prove that no algorithm can achieve lower regret uniformly on this same class of BMDPs. This establishes that, on this class, the algorithm achieves asymptotic optimality.
comment: 74 pages, 3 figures
☆ Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2Verify is designed to rigorously assess step-level verifiers at the frontier: Verifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.
comment: 21 pages, 8 figures, 5 tables
☆ Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs
Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at https://github.com/mmunir127/LogViG-Official.
comment: Published in the Proceedings of the Third Learning on Graphs Conference (LoG 2024)
☆ Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling
Kilometer-scale weather data is crucial for real-world applications but remains computationally intensive to produce using traditional weather simulations. An emerging solution is to use deep learning models, which offer a faster alternative for climate downscaling. However, their reliability is still in question, as they are often evaluated using standard machine learning metrics rather than insights from atmospheric and weather physics. This paper benchmarks recent state-of-the-art deep learning models and introduces physics-inspired diagnostics to evaluate their performance and reliability, with a particular focus on geographic generalization and physical consistency. Our experiments show that, despite the seemingly strong performance of models such as CorrDiff, when trained on a limited set of European geographies (e.g., central Europe), they struggle to generalize to other regions such as Iberia, Morocco in the south, or Scandinavia in the north. They also fail to accurately capture second-order variables such as divergence and vorticity derived from predicted velocity fields. These deficiencies appear even in in-distribution geographies, indicating challenges in producing physically consistent predictions. We propose a simple initial solution: introducing a power spectral density loss function that empirically improves geographic generalization by encouraging the reconstruction of small-scale physical structures. The code for reproducing the experimental results can be found at https://github.com/CarloSaccardi/PSD-Downscaling
☆ Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe
Pruning is a common technique to reduce the compute and storage requirements of Neural Networks. While conventional approaches typically retrain the model to recover pruning-induced performance degradation, state-of-the-art Large Language Model (LLM) pruning methods operate layer-wise, minimizing the per-layer pruning error on a small calibration dataset to avoid full retraining, which is considered computationally prohibitive for LLMs. However, finding the optimal pruning mask is a hard combinatorial problem and solving it to optimality is intractable. Existing methods hence rely on greedy heuristics that ignore the weight interactions in the pruning objective. In this work, we instead consider the convex relaxation of these combinatorial constraints and solve the resulting problem using the Frank-Wolfe (FW) algorithm. Our method drastically reduces the per-layer pruning error, outperforms strong baselines on state-of-the-art GPT architectures, and remains memory-efficient. We provide theoretical justification by showing that, combined with the convergence guarantees of the FW algorithm, we obtain an approximate solution to the original combinatorial problem upon rounding the relaxed solution to integrality.
☆ Dedelayed: Deleting remote inference delay via on-device correction
Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.
☆ Training LLM Agents to Empower Humans
Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human attain their objectives. Additionally, these methods often require costly explicit human feedback to provide a training signal. We propose a new approach to tuning assistive language models based on maximizing the human's empowerment, their ability to effect desired changes in the environment. Our empowerment-maximizing method, Empower, only requires offline text data, providing a self-supervised method for fine-tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18-person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate and 38% fewer suggestions. Additionally, we introduce a new environment for evaluating multi-turn code assistance using simulated humans. Using this environment, we show that agents trained with Empower increase the success rate of a simulated human programmer on challenging coding questions by an average of 192% over an SFT baseline. With this empowerment objective, we provide a framework for useful aligned AI agents at scale using only offline data without the need for any additional human feedback or verifiable rewards.
☆ Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.
☆ On Pretraining for Project-Level Code Completion
Repository-level pretraining is commonly used to enable large language models for code to leverage codebase-wide context. This enhances their ability to generate accurate and context-aware code completions. In this work, we investigate how different repository-processing strategies affect in-context learning in OpenCoder, a 1.5B-parameter model. We extend its context window from 4,096 to 16,384 tokens by training on additional 1B tokens of curated repository-level data. Despite relying on a smaller dataset than competing models (which often use hundreds of billions of tokens), our model achieves comparable performance on the Long Code Arena benchmark. We find that various repository-processing techniques yield similarly strong results, with the primary gain coming from adapting to a new rotary positional embedding (RoPE) scaling parameter. Finally, we show that a simpler file-level training approach at the original sequence length remains highly effective, opening up repository-level code completion research to settings with more constrained data and compute resources.
☆ Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM's IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.
comment: 46 pages, 36 figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence
☆ Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and Gauss-Newton (GN) methods: the former leverages statistics of current gradients and is the de-factor optimizers for neural networks, and the latter uses the diagonal elements of the Gauss-Newton matrix and underpins some of the recent diagonal optimizers such as Sophia. In this work, we compare these two diagonal preconditioning methods through the lens of two key factors: the choice of basis in the preconditioner, and the impact of gradient noise from mini-batching. To gain insights, we analyze these optimizers on quadratic objectives and logistic regression under all four quadrants. We show that regardless of the basis, there exist instances where Adam outperforms both GN$^{-1}$ and GN$^{-1/2}$ in full-batch settings. Conversely, in the stochastic regime, Adam behaves similarly to GN$^{-1/2}$ for linear regression under a Gaussian data assumption. These theoretical results are supported by empirical studies on both convex and non-convex objectives.
☆ Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning
Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. In this work, we propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.
☆ CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas
Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.
☆ Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference
Large Language Model (LLM) inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput.
☆ Axial Neural Networks for Dimension-Free Foundation Models
The advent of foundation models in AI has significantly advanced general-purpose learning, enabling remarkable capabilities in zero-shot inference and in-context learning. However, training such models on physics data, including solutions to partial differential equations (PDEs), poses a unique challenge due to varying dimensionalities across different systems. Traditional approaches either fix a maximum dimension or employ separate encoders for different dimensionalities, resulting in inefficiencies. To address this, we propose a dimension-agnostic neural network architecture, the Axial Neural Network (XNN), inspired by parameter-sharing structures such as Deep Sets and Graph Neural Networks. XNN generalizes across varying tensor dimensions while maintaining computational efficiency. We convert existing PDE foundation models into axial neural networks and evaluate their performance across three training scenarios: training from scratch, pretraining on multiple PDEs, and fine-tuning on a single PDE. Our experiments show that XNNs perform competitively with original models and exhibit superior generalization to unseen dimensions, highlighting the importance of multidimensional pretraining for foundation models.
☆ Time Series Foundation Models: Benchmarking Challenges and Requirements
Time Series Foundation Models (TSFMs) represent a new paradigm for time series forecasting, offering zero-shot forecasting capabilities without the need for domain-specific pre-training or fine-tuning. However, as with Large Language Models (LLMs), evaluating TSFMs is tricky, as with ever more extensive training sets, it becomes more and more challenging to ensure the integrity of benchmarking data. Our investigation of existing TSFM evaluation highlights multiple challenges, ranging from the representativeness of the benchmark datasets, over the lack of spatiotemporal evaluation, to risks of information leakage due to overlapping and obscure datasets, and the memorization of global patterns caused by external shocks like economic crises or pandemics. Our findings reveal widespread confusion regarding data partitions, risking inflated performance estimates and incorrect transfer of global knowledge to local time series. We argue for the development of robust evaluation methodologies to prevent pitfalls already observed in LLM and classical time series benchmarking, and call upon the research community to design new, principled approaches, such as evaluations on truly out-of-sample future data, to safeguard the integrity of TSFM assessment.
☆ What is the objective of reasoning with reinforcement learning?
We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.
☆ Multivariate Time Series Forecasting with Gate-Based Quantum Reservoir Computing on NISQ Hardware
Quantum reservoir computing (QRC) offers a hardware-friendly approach to temporal learning, yet most studies target univariate signals and overlook near-term hardware constraints. This work introduces a gate-based QRC for multivariate time series (MTS-QRC) that pairs injection and memory qubits and uses a Trotterized nearest-neighbor transverse-field Ising evolution optimized for current device connectivity and depth. On Lorenz-63 and ENSO, the method achieves a mean square error (MSE) of 0.0087 and 0.0036, respectively, performing on par with classical reservoir computing on Lorenz and above learned RNNs on both, while NVAR and clustered ESN remain stronger on some settings. On IBM Heron R2, MTS-QRC sustains accuracy with realistic depths and, interestingly, outperforms a noiseless simulator on ENSO; singular value analysis indicates that device noise can concentrate variance in feature directions, acting as an implicit regularizer for linear readout in this regime. These findings support the practicality of gate-based QRC for MTS forecasting on NISQ hardware and motivate systematic studies on when and how hardware noise benefits QRC readouts.
☆ Unlocking Public Catalogues: Instruction-Tuning LLMs for ICD Coding of German Tumor Diagnoses
Accurate coding of tumor diagnoses with ICD-10-GM and ICD-O-3 is essential for structured cancer documentation in Germany. Smaller open-weight LLMs are appealing for privacy-preserving automation but often struggle with coding accuracy in German-language contexts. This study investigates whether instruction-based fine-tuning on public datasets improves the coding accuracy of open-weight LLMs for German tumor diagnosis texts. The evaluation uses coded diagnoses from the local tumor documentation system as test data. In a systematic data quality assessment, the upper limit for ICD-10 coding performance was estimated at 60-79% for exact and 81-94% for partial (three-character codes only) derivation. As training data, over 500,000 question-answer pairs were created based on the ICD-10-GM, ICD-O-3, and OPS catalogues. Eight open-weight models from the Qwen, Llama, and Mistral families (7-70 B parameters) were fine-tuned. ICD-10-GM accuracy rose from 1.4-24% to 41-58%, and partial accuracy from 31-74% to 73-83%. The accuracy of ICD-O-3 topography coding also improved but started and remained considerably lower with an exact accuracy of 22-40% and a partial accuracy of 56-67% after fine-tuning. Malformed code outputs dropped to 0% for all models. Tumor-diagnosis recognition reached 99%. Accuracy correlated positively with model size, but gaps between small and large models narrowed after fine-tuning. The reasoning mode in Qwen3 generally yielded a lower performance than fine-tuning and was over 100 times slower. Our findings highlight the potential of leveraging public catalogues to build instruction datasets that improve LLMs in medical documentation tasks. The complete training dataset and the best-performing checkpoints of the fine-tuned models are available from https://huggingface.co/datasets/stefan-m-lenz/ICDOPS-QA-2024.
comment: 19 pages, 4 figures
☆ Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings
Classical nonlinear dimensionality reduction (NLDR) techniques like t-SNE, Isomap, and LLE excel at creating low-dimensional embeddings for data visualization but fundamentally lack the ability to map these embeddings back to the original high-dimensional space. This one-way transformation limits their use in generative applications. This paper addresses this critical gap by introducing a system- atic framework for constructing neural decoder architectures for prominent NLDR methods, enabling bidirectional mapping for the first time. We extend this framework by implementing a diffusion-based generative process that operates directly within these learned manifold spaces. Through experiments on the CelebA dataset, we evaluate the reconstruction and generative performance of our approach against autoencoder and standard diffusion model baselines. Our findings reveal a fundamental trade- off: while the decoders successfully reconstruct data, their quality is surpassed by end-to-end optimized autoencoders. Moreover, manifold-constrained diffusion yields poor-quality samples, suggesting that the discrete and sparse nature of classical NLDR embeddings is ill-suited for the continuous inter- polation required by generative models. This work highlights the inherent challenges in retrofitting generative capabilities onto NLDR methods designed primarily for visualization and analysis.
☆ Message Passing on the Edge: Towards Scalable and Expressive GNNs
We propose EB-1WL, an edge-based color-refinement test, and a corresponding GNN architecture, EB-GNN. Our architecture is inspired by a classic triangle counting algorithm by Chiba and Nishizeki, and explicitly uses triangles during message passing. We achieve the following results: (1)~EB-1WL is significantly more expressive than 1-WL. Further, we provide a complete logical characterization of EB-1WL based on first-order logic, and matching distinguishability results based on homomorphism counting. (2)~In an important distinction from previous proposals for more expressive GNN architectures, EB-1WL and EB-GNN require near-linear time and memory on practical graph learning tasks. (3)~Empirically, we show that EB-GNN is a highly-efficient general-purpose architecture: It substantially outperforms simple MPNNs, and remains competitive with task-specialized GNNs while being significantly more computationally efficient.
☆ Towards Robust Knowledge Removal in Federated Learning with High Data Heterogeneity
Nowdays, there are an abundance of portable devices capable of collecting large amounts of data and with decent computational power. This opened the possibility to train AI models in a distributed manner, preserving the participating clients' privacy. However, because of privacy regulations and safety requirements, elimination upon necessity of a client contribution to the model has become mandatory. The cleansing process must satisfy specific efficacy and time requirements. In recent years, research efforts have produced several knowledge removal methods, but these require multiple communication rounds between the data holders and the process coordinator. This can cause the unavailability of an effective model up to the end of the removal process, which can result in a disservice to the system users. In this paper, we introduce an innovative solution based on Task Arithmetic and the Neural Tangent Kernel, to rapidly remove a client's influence from a model.
☆ NOSA: Native and Offloadable Sparse Attention
Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).
comment: Preprint
☆ Physics-augmented Multi-task Gaussian Process for Modeling Spatiotemporal Dynamics
Recent advances in sensing and imaging technologies have enabled the collection of high-dimensional spatiotemporal data across complex geometric domains. However, effective modeling of such data remains challenging due to irregular spatial structures, rapid temporal dynamics, and the need to jointly predict multiple interrelated physical variables. This paper presents a physics-augmented multi-task Gaussian Process (P-M-GP) framework tailored for spatiotemporal dynamic systems. Specifically, we develop a geometry-aware, multi-task Gaussian Process (M-GP) model to effectively capture intrinsic spatiotemporal structure and inter-task dependencies. To further enhance the model fidelity and robustness, we incorporate governing physical laws through a physics-based regularization scheme, thereby constraining predictions to be consistent with governing dynamical principles. We validate the proposed P-M-GP framework on a 3D cardiac electrodynamics modeling task. Numerical experiments demonstrate that our method significantly improves prediction accuracy over existing methods by effectively incorporating domain-specific physical constraints and geometric prior.
comment: 13 pages, 5 figures
☆ EEGChaT: A Transformer-Based Modular Channel Selector for SEEG Analysis
Analyzing stereoelectroencephalography (SEEG) signals is critical for brain-computer interface (BCI) applications and neuroscience research, yet poses significant challenges due to the large number of input channels and their heterogeneous relevance. Traditional channel selection methods struggle to scale or provide meaningful interpretability for SEEG data. In this work, we propose EEGChaT, a novel Transformer-based channel selection module designed to automatically identify the most task-relevant channels in SEEG recordings. EEGChaT introduces Channel Aggregation Tokens (CATs) to aggregate information across channels, and leverages an improved Attention Rollout technique to compute interpretable, quantitative channel importance scores. We evaluate EEGChaT on the DuIN dataset, demonstrating that integrating EEGChaT with existing classification models consistently improves decoding accuracy, achieving up to 17\% absolute gains. Furthermore, the channel weights produced by EEGChaT show substantial overlap with manually selected channels, supporting the interpretability of the approach. Our results suggest that EEGChaT is an effective and generalizable solution for channel selection in high-dimensional SEEG analysis, offering both enhanced performance and insights into neural signal relevance.
☆ On the identifiability of causal graphs with multiple environments
Causal discovery from i.i.d. observational data is known to be generally ill-posed. We demonstrate that if we have access to the distribution of a structural causal model, and additional data from only two environments that sufficiently differ in the noise statistics, the unique causal graph is identifiable. Notably, this is the first result in the literature that guarantees the entire causal graph recovery with a constant number of environments and arbitrary nonlinear mechanisms. Our only constraint is the Gaussianity of the noise terms; however, we propose potential ways to relax this requirement. Of interest on its own, we expand on the well-known duality between independent component analysis (ICA) and causal discovery; recent advancements have shown that nonlinear ICA can be solved from multiple environments, at least as many as the number of sources: we show that the same can be achieved for causal discovery while having access to much less auxiliary information.
comment: Preprint
☆ ArtNet: Hierarchical Clustering-Based Artificial Netlist Generator for ML and DTCO Application
In advanced nodes, optimization of power, performance and area (PPA) has become highly complex and challenging. Machine learning (ML) and design-technology co-optimization (DTCO) provide promising mitigations, but face limitations due to a lack of diverse training data as well as long design flow turnaround times (TAT). We propose ArtNet, a novel artificial netlist generator designed to tackle these issues. Unlike previous methods, ArtNet replicates key topological characteristics, enhancing ML model generalization and supporting broader design space exploration for DTCO. By producing realistic artificial datasets that moreclosely match given target parameters, ArtNet enables more efficient PPAoptimization and exploration of flows and design enablements. In the context of CNN-based DRV prediction, ArtNet's data augmentationimproves F1 score by 0.16 compared to using only the original (real) dataset. In the DTCO context, ArtNet-generated mini-brains achieve a PPA match up to 97.94%, demonstrating close alignment with design metrics of targeted full-scale block designs.
☆ Selective Adversarial Attacks on LLM Benchmarks
Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized text attacks that affect many models equally, leaving open the question of whether it is possible to selectively degrade or enhance performance while minimally affecting other models. We formalize this problem and study selective adversarial attacks on MMLU - a widely used benchmark designed to measure a language model's broad general knowledge and reasoning ability across different subjects. Using canonical attacks integrated into TextAttack framework, we introduce a protocol for selectivity assessment, develop a custom constraint to increase selectivity of attacks and propose a surrogate-LLM pipeline that generates selective perturbations. Empirically, we find that selective adversarial attacks exist and can materially alter relative rankings, challenging the fairness, reproducibility, and transparency of leaderboard-driven evaluation. Our results motivate perturbation-aware reporting and robustness diagnostics for LLM evaluation and demonstrate that even subtle edits can shift comparative judgments.
☆ DOLFIN: Balancing Stability and Plasticity in Federated Continual Learning
Federated continual learning (FCL) enables models to learn new tasks across multiple distributed clients, protecting privacy and without forgetting previously acquired knowledge. However, current methods face challenges balancing performance, privacy preservation, and communication efficiency. We introduce a Distributed Online LoRA for Federated INcremental learning method DOLFIN, a novel approach combining Vision Transformers with low-rank adapters designed to efficiently and stably learn new tasks in federated environments. Our method leverages LoRA for minimal communication overhead and incorporates DualGradient Projection Memory (DualGPM) to prevent forgetting. Evaluated on CIFAR-100, ImageNet-R, ImageNet-A, and CUB-200 under two Dirichlet heterogeneity settings, DOLFIN consistently surpasses six strong baselines in final average accuracy while matching their memory footprint. Orthogonal low-rank adapters offer an effective and scalable solution for privacy-preserving continual learning in federated settings.
☆ Multi-Objective $\textit{min-max}$ Online Convex Optimization
In online convex optimization (OCO), a single loss function sequence is revealed over a time horizon of $T$, and an online algorithm has to choose its action at time $t$, before the loss function at time $t$ is revealed. The goal of the online algorithm is to incur minimal penalty (called $\textit{regret}$ compared to a static optimal action made by an optimal offline algorithm knowing all functions of the sequence in advance. In this paper, we broaden the horizon of OCO, and consider multi-objective OCO, where there are $K$ distinct loss function sequences, and an algorithm has to choose its action at time $t$, before the $K$ loss functions at time $t$ are revealed. To capture the tradeoff between tracking the $K$ different sequences, we consider the $\textit{min-max}$ regret, where the benchmark (optimal offline algorithm) takes a static action across all time slots that minimizes the maximum of the total loss (summed across time slots) incurred by each of the $K$ sequences. An online algorithm is allowed to change its action across time slots, and its {\it min-max} regret is defined as the difference between its $\textit{min-max}$ cost and that of the benchmark. The $\textit{min-max}$ regret is a stringent performance measure and an algorithm with small regret needs to `track' all loss function sequences closely at all times. We consider this $\textit{min-max}$ regret in the i.i.d. input setting where all loss functions are i.i.d. generated from an unknown distribution. For the i.i.d. model we propose a simple algorithm that combines the well-known $\textit{Hedge}$ and online gradient descent (OGD) and show via a remarkably simple proof that its expected $\textit{min-max}$ regret is $O(\sqrt{T \log K})$.
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
comment: 23 pages, 8 figures, 5 tables
☆ ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling
Topic modeling is a useful tool for analyzing large corpora of written documents, particularly academic papers. Despite a wide variety of proposed topic modeling techniques, these techniques do not perform well when applied to medical texts. This can be due to the low number of documents available for some topics in the healthcare domain. In this paper, we propose ProtoTopic, a prototypical network-based topic model used for topic generation for a set of medical paper abstracts. Prototypical networks are efficient, explainable models that make predictions by computing distances between input datapoints and a set of prototype representations, making them particularly effective in low-data or few-shot learning scenarios. With ProtoTopic, we demonstrate improved topic coherence and diversity compared to two topic modeling baselines used in the literature, demonstrating the ability of our model to generate medically relevant topics even with limited data.
☆ K-Merge: Online Continual Merging of Adapters for On-device Large Language Models
On-device deployment of Large Language Models (LLMs) frequently leverages Low-Rank Adapters (LoRAs) to support diverse downstream tasks under tight resource constraints. To address the limited storage capacity of mobile devices, recent works have explored model merging techniques to fuse multiple LoRAs into a single one. In practice, however, LoRAs are often delivered incrementally, as users request support for new tasks (e.g., novel problem types or languages). This scenario introduces a new challenge: on-device online continual merging, where the objective is to incorporate new LoRAs while preserving the performance on previously supported tasks. In this paper, we propose a data-free and computationally efficient strategy for selecting and merging LoRAs when a new one becomes available, assuming the device can store only a limited number of adapters. Extensive experiments across real-world tasks demonstrate the superiority of our approach compared to alternative strategies while adhering to the storage budget and compute limitations of on-device settings.
comment: 15 pages, 8 figures
☆ Data-driven learning of feedback maps for explicit robust predictive control: an approximation theoretic view
We establish an algorithm to learn feedback maps from data for a class of robust model predictive control (MPC) problems. The algorithm accounts for the approximation errors due to the learning directly at the synthesis stage, ensuring recursive feasibility by construction. The optimal control problem consists of a linear noisy dynamical system, a quadratic stage and quadratic terminal costs as the objective, and convex constraints on the state, control, and disturbance sequences; the control minimizes and the disturbance maximizes the objective. We proceed via two steps -- (a) Data generation: First, we reformulate the given minmax problem into a convex semi-infinite program and employ recently developed tools to solve it in an exact fashion on grid points of the state space to generate (state, action) data. (b) Learning approximate feedback maps: We employ a couple of approximation schemes that furnish tight approximations within preassigned uniform error bounds on the admissible state space to learn the unknown feedback policy. The stability of the closed-loop system under the approximate feedback policies is also guaranteed under a standard set of hypotheses. Two benchmark numerical examples are provided to illustrate the results.
comment: 27 pages; submitted
☆ Narrow Operator Models of Stellarator Equilibria in Fourier Zernike Basis
Numerical computation of the ideal Magnetohydrodynamic (MHD) equilibrium magnetic field is at the base of stellarator optimisation and provides the starting point for solving more sophisticated Partial Differential Equations (PDEs) like transport or turbulence models. Conventional approaches solve for a single stationary point of the ideal MHD equations, which is fully defined by three invariants and the numerical scheme employed by the solver. We present the first numerical approach that can solve for a continuous distribution of equilibria with fixed boundary and rotational transform, varying only the pressure invariant. This approach minimises the force residual by optimising parameters of multilayer perceptrons (MLP) that map from a scalar pressure multiplier to the Fourier Zernike basis as implemented in the modern stellarator equilibrium solver DESC.
comment: 15 pages, 6 figures, 1 table
☆ Offline and Online KL-Regularized RLHF under Differential Privacy
In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $\epsilon$ local differential privacy ($\epsilon$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\tilde{O}(1/[(e^\epsilon-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$, where $T$ is the total time step, $N_{\mathcal{F}}$ is cardinality of the reward function space $\mathcal{F}$ and $d_{\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.
☆ DistilCLIP-EEG: Enhancing Epileptic Seizure Detection Through Multi-modal Learning and Knowledge Distillation
Epilepsy is a prevalent neurological disorder marked by sudden, brief episodes of excessive neuronal activity caused by abnormal electrical discharges, which may lead to some mental disorders. Most existing deep learning methods for epilepsy detection rely solely on unimodal EEG signals, neglecting the potential benefits of multimodal information. To address this, we propose a novel multimodal model, DistilCLIP-EEG, based on the CLIP framework, which integrates both EEG signals and text descriptions to capture comprehensive features of epileptic seizures. The model involves an EEG encoder based on the Conformer architecture as a text encoder, the proposed Learnable BERT (BERT-LP) as prompt learning within the encoders. Both operate in a shared latent space for effective cross-modal representation learning. To enhance efficiency and adaptability, we introduce a knowledge distillation method where the trained DistilCLIP-EEG serves as a teacher to guide a more compact student model to reduce training complexity and time. On the TUSZ, AUBMC, and CHB-MIT datasets, both the teacher and student models achieved accuracy rates exceeding 97%. Across all datasets, the F1-scores were consistently above 0.94, demonstrating the robustness and reliability of the proposed framework. Moreover, the student model's parameter count and model size are approximately 58.1% of those of the teacher model, significantly reducing model complexity and storage requirements while maintaining high performance. These results highlight the potential of our proposed model for EEG-based epilepsy detection and establish a solid foundation for deploying lightweight models in resource-constrained settings.
comment: 16 pages, 9 figures, 5 tables
☆ ExpressNet-MoE: A Hybrid Deep Neural Network for Emotion Recognition
In many domains, including online education, healthcare, security, and human-computer interaction, facial emotion recognition (FER) is essential. Real-world FER is still difficult despite its significance because of some factors such as variable head positions, occlusions, illumination shifts, and demographic diversity. Engagement detection, which is essential for applications like virtual learning and customer services, is frequently challenging due to FER limitations by many current models. In this article, we propose ExpressNet-MoE, a novel hybrid deep learning model that blends both Convolution Neural Networks (CNNs) and Mixture of Experts (MoE) framework, to overcome the difficulties. Our model dynamically chooses the most pertinent expert networks, thus it aids in the generalization and providing flexibility to model across a wide variety of datasets. Our model improves on the accuracy of emotion recognition by utilizing multi-scale feature extraction to collect both global and local facial features. ExpressNet-MoE includes numerous CNN-based feature extractors, a MoE module for adaptive feature selection, and finally a residual network backbone for deep feature learning. To demonstrate efficacy of our proposed model we evaluated on several datasets, and compared with current state-of-the-art methods. Our model achieves accuracies of 74.77% on AffectNet (v7), 72.55% on AffectNet (v8), 84.29% on RAF-DB, and 64.66% on FER-2013. The results show how adaptive our model is and how it may be used to develop end-to-end emotion recognition systems in practical settings. Reproducible codes and results are made publicly accessible at https://github.com/DeeptimaanB/ExpressNet-MoE.
comment: * Current version of the manuscript contains 17 pages including text, 13 figures, and 4 tables. The manuscript is currently under review at a journal
☆ Tahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM
Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.
☆ Towards Blackwell Optimality: Bellman Optimality Is All You Can Get
Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.
☆ Near-Infrared Hyperspectral Imaging Applications in Food Analysis -- Improving Algorithms and Methodologies
This thesis investigates the application of near-infrared hyperspectral imaging (NIR-HSI) for food quality analysis. The investigation is conducted through four studies operating with five research hypotheses. For several analyses, the studies compare models based on convolutional neural networks (CNNs) and partial least squares (PLS). Generally, joint spatio-spectral analysis with CNNs outperforms spatial analysis with CNNs and spectral analysis with PLS when modeling parameters where chemical and physical visual information are relevant. When modeling chemical parameters with a 2-dimensional (2D) CNN, augmenting the CNN with an initial layer dedicated to performing spectral convolution enhances its predictive performance by learning a spectral preprocessing similar to that applied by domain experts. Still, PLS-based spectral modeling performs equally well for analysis of the mean content of chemical parameters in samples and is the recommended approach. Modeling the spatial distribution of chemical parameters with NIR-HSI is limited by the ability to obtain spatially resolved reference values. Therefore, a study used bulk mean references for chemical map generation of fat content in pork bellies. A PLS-based approach gave non-smooth chemical maps and pixel-wise predictions outside the range of 0-100\%. Conversely, a 2D CNN augmented with a spectral convolution layer mitigated all issues arising with PLS. The final study attempted to model barley's germinative capacity by analyzing NIR spectra, RGB images, and NIR-HSI images. However, the results were inconclusive due to the dataset's low degree of germination. Additionally, this thesis has led to the development of two open-sourced Python packages. The first facilitates fast PLS-based modeling, while the second facilitates very fast cross-validation of PLS and other classical machine learning models with a new algorithm.
comment: PhD thesis
☆ $L_2$-Regularized Empirical Risk Minimization Guarantees Small Smooth Calibration Error
Calibration of predicted probabilities is critical for reliable machine learning, yet it is poorly understood how standard training procedures yield well-calibrated models. This work provides the first theoretical proof that canonical $L_{2}$-regularized empirical risk minimization directly controls the smooth calibration error (smCE) without post-hoc correction or specialized calibration-promoting regularizer. We establish finite-sample generalization bounds for smCE based on optimization error, regularization strength, and the Rademacher complexity. We then instantiate this theory for models in reproducing kernel Hilbert spaces, deriving concrete guarantees for kernel ridge and logistic regression. Our experiments confirm these specific guarantees, demonstrating that $L_{2}$-regularized ERM can provide a well-calibrated model without boosting or post-hoc recalibration. The source code to reproduce all experiments is available at https://github.com/msfuji0211/erm_calibration.
comment: 26 pages, 8 figures
☆ Robust Minimax Boosting with Performance Guarantees
Boosting methods often achieve excellent classification accuracy, but can experience notable performance degradation in the presence of label noise. Existing robust methods for boosting provide theoretical robustness guarantees for certain types of label noise, and can exhibit only moderate performance degradation. However, previous theoretical results do not account for realistic types of noise and finite training sizes, and existing robust methods can provide unsatisfactory accuracies, even without noise. This paper presents methods for robust minimax boosting (RMBoost) that minimize worst-case error probabilities and are robust to general types of label noise. In addition, we provide finite-sample performance guarantees for RMBoost with respect to the error obtained without noise and with respect to the best possible error (Bayes risk). The experimental results corroborate that RMBoost is not only resilient to label noise but can also provide strong classification accuracy.
☆ Neural Sum-of-Squares: Certifying the Nonnegativity of Polynomials with Transformers
Certifying nonnegativity of polynomials is a well-known NP-hard problem with direct applications spanning non-convex optimization, control, robotics, and beyond. A sufficient condition for nonnegativity is the Sum of Squares (SOS) property, i.e., it can be written as a sum of squares of other polynomials. In practice, however, certifying the SOS criterion remains computationally expensive and often involves solving a Semidefinite Program (SDP), whose dimensionality grows quadratically in the size of the monomial basis of the SOS expression; hence, various methods to reduce the size of the monomial basis have been proposed. In this work, we introduce the first learning-augmented algorithm to certify the SOS criterion. To this end, we train a Transformer model that predicts an almost-minimal monomial basis for a given polynomial, thereby drastically reducing the size of the corresponding SDP. Our overall methodology comprises three key components: efficient training dataset generation of over 100 million SOS polynomials, design and training of the corresponding Transformer architecture, and a systematic fallback mechanism to ensure correct termination, which we analyze theoretically. We validate our approach on over 200 benchmark datasets, achieving speedups of over $100\times$ compared to state-of-the-art solvers and enabling the solution of instances where competing approaches fail. Our findings provide novel insights towards transforming the practical scalability of SOS programming.
☆ Steerable Conditional Diffusion for Domain Adaptation in PET Image Reconstruction
Diffusion models have recently enabled state-of-the-art reconstruction of positron emission tomography (PET) images while requiring only image training data. However, domain shift remains a key concern for clinical adoption: priors trained on images from one anatomy, acquisition protocol or pathology may produce artefacts on out-of-distribution data. We propose integrating steerable conditional diffusion (SCD) with our previously-introduced likelihood-scheduled diffusion (PET-LiSch) framework to improve the alignment of the diffusion model's prior to the target subject. At reconstruction time, for each diffusion step, we use low-rank adaptation (LoRA) to align the diffusion model prior with the target domain on the fly. Experiments on realistic synthetic 2D brain phantoms demonstrate that our approach suppresses hallucinated artefacts under domain shift, i.e. when our diffusion model is trained on perturbed images and tested on normal anatomy, our approach suppresses the hallucinated structure, outperforming both OSEM and diffusion model baselines qualitatively and quantitatively. These results provide a proof-of-concept that steerable priors can mitigate domain shift in diffusion-based PET reconstruction and motivate future evaluation on real data.
comment: Accepted for oral presentation at IEEE NSS MIC RTSD 2025 (submitted May 2025; accepted July 2025; to be presented Nov 2025)
☆ Rectify and Align GPS Points to Parking Spots via Rank-1 Constraint
Parking spots are essential components, providing vital mobile resources for residents in a city. Accurate Global Positioning System (GPS) points of parking spots are the core data for subsequent applications,e.g., parking management, parking policy, and urban development. However, high-rise buildings tend to cause GPS points to drift from the actual locations of parking spots; besides, the standard lower-cost GPS equipment itself has a certain location error. Therefore, it is a non-trivial task to correct a few wrong GPS points from a large number of parking spots in an unsupervised approach. In this paper, motivated by the physical constraints of parking spots (i.e., parking spots are parallel to the sides of roads), we propose an unsupervised low-rank method to effectively rectify errors in GPS points and further align them to the parking spots in a unified framework. The proposed unconventional rectification and alignment method is simple and yet effective for any type of GPS point errors. Extensive experiments demonstrate the superiority of the proposed method to solve a practical problem. The data set and the code are publicly accessible at:https://github.com/pangjunbiao/ITS-Parking-spots-Dataset.
☆ Hybrid Interval Type-2 Mamdani-TSK Fuzzy System for Regression Analysis
Regression analysis is employed to examine and quantify the relationships between input variables and a dependent and continuous output variable. It is widely used for predictive modelling in fields such as finance, healthcare, and engineering. However, traditional methods often struggle with real-world data complexities, including uncertainty and ambiguity. While deep learning approaches excel at capturing complex non-linear relationships, they lack interpretability and risk over-fitting on small datasets. Fuzzy systems provide an alternative framework for handling uncertainty and imprecision, with Mamdani and Takagi-Sugeno-Kang (TSK) systems offering complementary strengths: interpretability versus accuracy. This paper presents a novel fuzzy regression method that combines the interpretability of Mamdani systems with the precision of TSK models. The proposed approach introduces a hybrid rule structure with fuzzy and crisp components and dual dominance types, enhancing both accuracy and explainability. Evaluations on benchmark datasets demonstrate state-of-the-art performance in several cases, with rules maintaining a component similar to traditional Mamdani systems while improving precision through improved rule outputs. This hybrid methodology offers a balanced and versatile tool for predictive modelling, addressing the trade-off between interpretability and accuracy inherent in fuzzy systems. In the 6 datasets tested, the proposed approach gave the best fuzzy methodology score in 4 datasets, out-performed the opaque models in 2 datasets and produced the best overall score in 1 dataset with the improvements in RMSE ranging from 0.4% to 19%.
☆ Near-Optimality of Contrastive Divergence Algorithms
We perform a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an $O(n^{-1 / 3})$ rate to the true parameter of the data distribution, we show, under some regularity assumptions, that CD can achieve the parametric rate $O(n^{-1 / 2})$. Our analysis provides results for various data batching schemes, including the fully online and minibatch ones. We additionally show that CD can be near-optimal, in the sense that its asymptotic variance is close to the Cram\'er-Rao lower bound.
comment: 54 pages
☆ Modeling Adoptive Cell Therapy in Bladder Cancer from Sparse Biological Data using PINNs
Physics-informed neural networks (PINNs) are neural networks that embed the laws of dynamical systems modeled by differential equations into their loss function as constraints. In this work, we present a PINN framework applied to oncology. Here, we seek to learn time-varying interactions due to a combination therapy in a tumor microenvironment. In oncology, experimental data are often sparse and composed of a few time points of tumor volume. By embedding inductive biases derived from prior information about a dynamical system, we extend the physics-informed neural networks (PINN) and incorporate observed biological constraints as regularization agents. The modified PINN algorithm is able to steer itself to a reasonable solution and can generalize well with only a few training examples. We demonstrate the merit of our approach by learning the dynamics of treatment applied intermittently in an ordinary differential equation (ODE) model of a combination therapy. The algorithm yields a solution to the ODE and time-varying forms of some of the ODE model parameters. We demonstrate a strong convergence using metrics such as the mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).
☆ When Embedding Models Meet: Procrustes Bounds and Applications
Embedding models trained separately on similar data often produce representations that encode stable information but are not directly interchangeable. This lack of interoperability raises challenges in several practical applications, such as model retraining, partial model upgrades, and multimodal search. Driven by these challenges, we study when two sets of embeddings can be aligned by an orthogonal transformation. We show that if pairwise dot products are approximately preserved, then there exists an isometry that closely aligns the two sets, and we provide a tight bound on the alignment error. This insight yields a simple alignment recipe, Procrustes post-processing, that makes two embedding models interoperable while preserving the geometry of each embedding space. Empirically, we demonstrate its effectiveness in three applications: maintaining compatibility across retrainings, combining different models for text retrieval, and improving mixed-modality search, where it achieves state-of-the-art performance.
☆ Optimizing Storage Overhead of User Behavior Log for ML-embedded Mobile Apps
Machine learning (ML) models are increasingly integrated into modern mobile apps to enable personalized and intelligent services. These models typically rely on rich input features derived from historical user behaviors to capture user intents. However, as ML-driven services become more prevalent, recording necessary user behavior data imposes substantial storage cost on mobile apps, leading to lower system responsiveness and more app uninstalls. To address this storage bottleneck, we present AdaLog, a lightweight and adaptive system designed to improve the storage efficiency of user behavior log in ML-embedded mobile apps, without compromising model inference accuracy or latency. We identify two key inefficiencies in current industrial practices of user behavior log: (i) redundant logging of overlapping behavior data across different features and models, and (ii) sparse storage caused by storing behaviors with heterogeneous attribute descriptions in a single log file. To solve these issues, AdaLog first formulates the elimination of feature-level redundant data as a maximum weighted matching problem in hypergraphs, and proposes a hierarchical algorithm for efficient on-device deployment. Then, AdaLog employs a virtually hashed attribute design to distribute heterogeneous behaviors into a few log files with physically dense storage. Finally, to ensure scalability to dynamic user behavior patterns, AdaLog designs an incremental update mechanism to minimize the I/O operations needed for adapting outdated behavior log. We implement a prototype of AdaLog and deploy it into popular mobile apps in collaboration with our industry partner. Evaluations on real-world user data show that AdaLog reduces behavior log size by 19% to 44% with minimal system overhead (only 2 seconds latency and 15 MB memory usage), providing a more efficient data foundation for broader adoption of on-device ML.
☆ SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with {Thermal} IR {(LWIR/MWIR)} and RGB
Enhancing scene understanding in adverse visibility conditions remains a critical challenge for surveillance and autonomous navigation systems. Conventional imaging modalities, such as RGB and thermal infrared (MWIR / LWIR), when fused, often struggle to deliver comprehensive scene information, particularly under conditions of atmospheric interference or inadequate illumination. To address these limitations, Short-Wave Infrared (SWIR) imaging has emerged as a promising modality due to its ability to penetrate atmospheric disturbances and differentiate materials with improved clarity. However, the advancement and widespread implementation of SWIR-based systems face significant hurdles, primarily due to the scarcity of publicly accessible SWIR datasets. In response to this challenge, our research introduces an approach to synthetically generate SWIR-like structural/contrast cues (without claiming spectral reproduction) images from existing LWIR data using advanced contrast enhancement techniques. We then propose a multimodal fusion framework integrating synthetic SWIR, LWIR, and RGB modalities, employing an optimized encoder-decoder neural network architecture with modality-specific encoders and a softmax-gated fusion head. Comprehensive experiments on public {RGB-LWIR benchmarks (M3FD, TNO, CAMEL, MSRS, RoadScene) and an additional private real RGB-MWIR-SWIR dataset} demonstrate that our synthetic-SWIR-enhanced fusion framework improves fused-image quality (contrast, edge definition, structural fidelity) while maintaining real-time performance. We also add fair trimodal baselines (LP, LatLRR, GFF) and cascaded trimodal variants of U2Fusion/SwinFusion under a unified protocol. The outcomes highlight substantial potential for real-world applications in surveillance and autonomous systems.
☆ F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs
Large Language Models (LLMs) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of LLM inference frameworks, such as llama.cpp, which support optimizations such as KV-caching and quantization, it is now easier than ever to deploy LLMs on edge devices. Quantization is fundamental to enable LLMs on resource-constrained edge devices, and llama.cpp utilizes block floating point (BFP) quantization to drastically reduce the bit width of weights and input tensors, the memory footprint, and the computational power required to run LLMs. LLMs are typically quantized with mixed BFP quantization across the model layers to reduce the loss of model accuracy due to quantization. Therefore, to efficiently accelerate across the layers of BFP-quantized LLMs, specialized accelerators need to support different BFP variants without reconfiguration. To address this issue, we propose a Flexible Block FloatingPoint Quantization (F-BFQ) accelerator, which can dynamically switch between two BFP quantization variants and perform matrix multiplication (MatMul) operations. Our initial F-BFQ accelerator design, deployed on the AMD Kria board, reduces inference time by 1.4x on average over the Arm NEON-based CPU execution across three BFP quantized LLMs while achieving 5.2 tokens per second (~3.9 words per second).
comment: Accepted to Workshop on New Approaches for Addressing the Computing Requirements of LLMs and GNNs (LG-ARC) @ ISCA 2025
☆ Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring
Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further develop a novel meta-learner that estimates the bounds using arbitrary machine learning models and with favorable theoretical properties, including double robustness and quasi-oracle efficiency. We demonstrate the practical value of our meta-learner through numerical experiments and in an application to a cancer drug trial. Together, our framework offers a practical tool for assessing the robustness of estimated treatment effects in the presence of censoring and thus promotes the reliable use of survival data for evidence generation in medicine and epidemiology.
☆ Going with the Flow: Approximating Banzhaf Values via Graph Neural Networks
Computing the Banzhaf value in network flow games is fundamental for quantifying agent influence in multi-agent systems, with applications ranging from cybersecurity to infrastructure planning. However, exact computation is intractable for systems with more than $\sim20$ agents due to exponential complexity $\mathcal{O}(2^m)$. While Monte Carlo sampling methods provide statistical estimates, they suffer from high sample complexity and cannot transfer knowledge across different network configurations, making them impractical for large-scale or dynamic systems. We present a novel learning-based approach using Graph Neural Networks (GNNs) to approximate Banzhaf values in cardinal network flow games. By framing the problem as a graph-level prediction task, our method learns generalisable patterns of agent influence directly from network topology and control structure. We conduct a comprehensive empirical study comparing three state-of-the-art GNN architectures-Graph Attention Networks (GAT), Graph Isomorphism Networks with Edge features (GINE), and EdgeConv-on a large-scale synthetic dataset of 200,000 graphs per configuration, varying in size (20-100 nodes), agent count (5-20), and edge probability (0.5-1.0). Our results demonstrate that trained GNN models achieve high-fidelity Banzhaf value approximation with order-of-magnitude speedups compared to exact and sampling-based methods. Most significantly, we show strong zero-shot generalisation: models trained on graphs of a specific size and topology accurately predict Banzhaf values for entirely new networks with different structural properties, without requiring retraining. This work establishes GNNs as a practical tool for scalable cooperative game-theoretic analysis of complex networked systems.
comment: 21 pages, 8 figures, 11-page appendix
☆ Prediction Markets with Intermittent Contributions
Although both data availability and the demand for accurate forecasts are increasing, collaboration between stakeholders is often constrained by data ownership and competitive interests. In contrast to recent proposals within cooperative game-theoretical frameworks, we place ourselves in a more general framework, based on prediction markets. There, independent agents trade forecasts of uncertain future events in exchange for rewards. We introduce and analyse a prediction market that (i) accounts for the historical performance of the agents, (ii) adapts to time-varying conditions, while (iii) permitting agents to enter and exit the market at will. The proposed design employs robust regression models to learn the optimal forecasts' combination whilst handling missing submissions. Moreover, we introduce a pay-off allocation mechanism that considers both in-sample and out-of-sample performance while satisfying several desirable economic properties. Case-studies using simulated and real-world data allow demonstrating the effectiveness and adaptability of the proposed market design.
comment: Submitted to PSCC 2026
☆ Contrastive Learning-Based Dependency Modeling for Anomaly Detection in Cloud Services
This paper addresses the challenges of complex dependencies and diverse anomaly patterns in cloud service environments by proposing a dependency modeling and anomaly detection method that integrates contrastive learning. The method abstracts service interactions into a dependency graph, extracts temporal and structural features through embedding functions, and employs a graph convolution mechanism to aggregate neighborhood information for context-aware service representations. A contrastive learning framework is then introduced, constructing positive and negative sample pairs to enhance the separability of normal and abnormal patterns in the representation space. Furthermore, a temporal consistency constraint is designed to maintain representation stability across time steps and reduce the impact of short-term fluctuations and noise. The overall optimization combines contrastive loss and temporal consistency loss to ensure stable and reliable detection across multi-dimensional features. Experiments on public datasets systematically evaluate the method from hyperparameter, environmental, and data sensitivity perspectives. Results show that the proposed approach significantly outperforms existing methods on key metrics such as Precision, Recall, F1-Score, and AUC, while maintaining robustness under conditions of sparse labeling, monitoring noise, and traffic fluctuations. This study verifies the effectiveness of integrating dependency modeling with contrastive learning, provides a complete technical solution for cloud service anomaly detection, and demonstrates strong adaptability and stability in complex environments.
☆ A New Perspective on Transformers in Online Reinforcement Learning for Continuous Control
Despite their effectiveness and popularity in offline or model-based reinforcement learning (RL), transformers remain underexplored in online model-free RL due to their sensitivity to training setups and model design decisions such as how to structure the policy and value networks, share components, or handle temporal information. In this paper, we show that transformers can be strong baselines for continuous control in online model-free RL. We investigate key design questions: how to condition inputs, share components between actor and critic, and slice sequential data for training. Our experiments reveal stable architectural and training strategies enabling competitive performance across fully and partially observable tasks, and in both vector- and image-based settings. These findings offer practical guidance for applying transformers in online RL.
☆ Generalist++: A Meta-learning Framework for Mitigating Trade-off in Adversarial Training
Despite the rapid progress of neural networks, they remain highly vulnerable to adversarial examples, for which adversarial training (AT) is currently the most effective defense. While AT has been extensively studied, its practical applications expose two major limitations: natural accuracy tends to degrade significantly compared with standard training, and robustness does not transfer well across attacks crafted under different norm constraints. Unlike prior works that attempt to address only one issue within a single network, we propose to partition the overall generalization goal into multiple sub-tasks, each assigned to a dedicated base learner. By specializing in its designated objective, each base learner quickly becomes an expert in its field. In the later stages of training, we interpolate their parameters to form a knowledgeable global learner, while periodically redistributing the global parameters back to the base learners to prevent their optimization trajectories from drifting too far from the shared target. We term this framework Generalist and introduce three variants tailored to different application scenarios. Both theoretical analysis and extensive experiments demonstrate that Generalist achieves lower generalization error and significantly alleviates the trade-off problems compared with baseline methods. Our results suggest that Generalist provides a promising step toward developing fully robust classifiers in the future.
☆ Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM) -- which has demonstrated strong performance in image recognition and image-text retrieval tasks -- to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.
comment: Accepted to ACM RecSys 2025 (Spotlight)
Kernel Representation and Similarity Measure for Incomplete Data
Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis. Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates. This paper presents the proximity kernel, a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space. The proposed method introduces data-dependent binning combined with proximity assignment to project data into a high-dimensional sparse representation that adapts to local density variations. For missing value handling, we propose a cascading fallback strategy to estimate missing feature distributions. We conduct clustering tasks on the proposed kernel representation across 12 real world incomplete datasets, demonstrating superior performance compared to existing methods while maintaining linear time complexity. All the code are available at https://anonymous.4open.science/r/proximity-kernel-2289.
☆ AOAD-MAT: Transformer-based multi-agent deep reinforcement learning model considering agents' order of action decisions
Multi-agent reinforcement learning focuses on training the behaviors of multiple learning agents that coexist in a shared environment. Recently, MARL models, such as the Multi-Agent Transformer (MAT) and ACtion dEpendent deep Q-learning (ACE), have significantly improved performance by leveraging sequential decision-making processes. Although these models can enhance performance, they do not explicitly consider the importance of the order in which agents make decisions. In this paper, we propose an Agent Order of Action Decisions-MAT (AOAD-MAT), a novel MAT model that considers the order in which agents make decisions. The proposed model explicitly incorporates the sequence of action decisions into the learning process, allowing the model to learn and predict the optimal order of agent actions. The AOAD-MAT model leverages a Transformer-based actor-critic architecture that dynamically adjusts the sequence of agent actions. To achieve this, we introduce a novel MARL architecture that cooperates with a subtask focused on predicting the next agent to act, integrated into a Proximal Policy Optimization based loss function to synergistically maximize the advantage of the sequential decision-making. The proposed method was validated through extensive experiments on the StarCraft Multi-Agent Challenge and Multi-Agent MuJoCo benchmarks. The experimental results show that the proposed AOAD-MAT model outperforms existing MAT and other baseline models, demonstrating the effectiveness of adjusting the AOAD order in MARL.
comment: This manuscript is an extended version of the work accepted as a short paper at the 26th International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2025). The Version of Record of this contribution is published in Springer's Lecture Notes in Artificial Intelligence series (LNCS/LNAI)
☆ Thompson Sampling via Fine-Tuning of LLMs
Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality--a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. We demonstrate that online fine-tuning significantly improves sample efficiency, with negligible impact on computational efficiency.
☆ When In Doubt, Abstain: The Impact of Abstention on Strategic Classification
Algorithmic decision making is increasingly prevalent, but often vulnerable to strategic manipulation by agents seeking a favorable outcome. Prior research has shown that classifier abstention (allowing a classifier to decline making a decision due to insufficient confidence) can significantly increase classifier accuracy. This paper studies abstention within a strategic classification context, exploring how its introduction impacts strategic agents' responses and how principals should optimally leverage it. We model this interaction as a Stackelberg game where a principal, acting as the classifier, first announces its decision policy, and then strategic agents, acting as followers, manipulate their features to receive a desired outcome. Here, we focus on binary classifiers where agents manipulate observable features rather than their true features, and show that optimal abstention ensures that the principal's utility (or loss) is no worse than in a non-abstention setting, even in the presence of strategic agents. We also show that beyond improving accuracy, abstention can also serve as a deterrent to manipulation, making it costlier for agents, especially those less qualified, to manipulate to achieve a positive outcome when manipulation costs are significant enough to affect agent behavior. These results highlight abstention as a valuable tool for reducing the negative effects of strategic behavior in algorithmic decision making systems.
☆ RockNet: Distributed Learning on Ultra-Low-Power Devices
As Machine Learning (ML) becomes integral to Cyber-Physical Systems (CPS), there is growing interest in shifting training from traditional cloud-based to on-device processing (TinyML), for example, due to privacy and latency concerns. However, CPS often comprise ultra-low-power microcontrollers, whose limited compute resources make training challenging. This paper presents RockNet, a new TinyML method tailored for ultra-low-power hardware that achieves state-of-the-art accuracy in timeseries classification, such as fault or malware detection, without requiring offline pretraining. By leveraging that CPS consist of multiple devices, we design a distributed learning method that integrates ML and wireless communication. RockNet leverages all devices for distributed training of specialized compute efficient classifiers that need minimal communication overhead for parallelization. Combined with tailored and efficient wireless multi-hop communication protocols, our approach overcomes the communication bottleneck that often occurs in distributed learning. Hardware experiments on a testbed with 20 ultra-low-power devices demonstrate RockNet's effectiveness. It successfully learns timeseries classification tasks from scratch, surpassing the accuracy of the latest approach for neural network microcontroller training by up to 2x. RockNet's distributed ML architecture reduces memory, latency and energy consumption per device by up to 90 % when scaling from one central device to 20 devices. Our results show that a tight integration of distributed ML, distributed computing, and communication enables, for the first time, training on ultra-low-power hardware with state-of-the-art accuracy.
☆ Isolation-based Spherical Ensemble Representations for Anomaly Detection
Anomaly detection is a critical task in data mining and management with applications spanning fraud detection, network security, and log monitoring. Despite extensive research, existing unsupervised anomaly detection methods still face fundamental challenges including conflicting distributional assumptions, computational inefficiency, and difficulty handling different anomaly types. To address these problems, we propose ISER (Isolation-based Spherical Ensemble Representations) that extends existing isolation-based methods by using hypersphere radii as proxies for local density characteristics while maintaining linear time and constant space complexity. ISER constructs ensemble representations where hypersphere radii encode density information: smaller radii indicate dense regions while larger radii correspond to sparse areas. We introduce a novel similarity-based scoring method that measures pattern consistency by comparing ensemble representations against a theoretical anomaly reference pattern. Additionally, we enhance the performance of Isolation Forest by using ISER and adapting the scoring function to address axis-parallel bias and local anomaly detection limitations. Comprehensive experiments on 22 real-world datasets demonstrate ISER's superior performance over 11 baseline methods.
☆ Km-scale dynamical downscaling through conformalized latent diffusion models
Dynamical downscaling is crucial for deriving high-resolution meteorological fields from coarse-scale simulations, enabling detailed analysis for critical applications such as weather forecasting and renewable energy modeling. Generative Diffusion models (DMs) have recently emerged as powerful data-driven tools for this task, offering reconstruction fidelity and more scalable sampling supporting uncertainty quantification. However, DMs lack finite-sample guarantees against overconfident predictions, resulting in miscalibrated grid-point-level uncertainty estimates hindering their reliability in operational contexts. In this work, we tackle this issue by augmenting the downscaling pipeline with a conformal prediction framework. Specifically, the DM's samples are post-processed to derive conditional quantile estimates, incorporated into a conformalized quantile regression procedure targeting locally adaptive prediction intervals with finite-sample marginal validity. The proposed approach is evaluated on ERA5 reanalysis data over Italy, downscaled to a 2-km grid. Results demonstrate grid-point-level uncertainty estimates with markedly improved coverage and stable probabilistic scores relative to the DM baseline, highlighting the potential of conformalized generative models for more trustworthy probabilistic downscaling to high-resolution meteorological fields.
comment: 7 pages
☆ Federated Conditional Conformal Prediction via Generative Models
Conformal Prediction (CP) provides distribution-free uncertainty quantification by constructing prediction sets that guarantee coverage of the true labels. This reliability makes CP valuable for high-stakes federated learning scenarios such as multi-center healthcare. However, standard CP assumes i.i.d. data, which is violated in federated settings where client distributions differ substantially. Existing federated CP methods address this by maintaining marginal coverage on each client, but such guarantees often fail to reflect input-conditional uncertainty. In this work, we propose Federated Conditional Conformal Prediction (Fed-CCP) via generative models, which aims for conditional coverage that adapts to local data heterogeneity. Fed-CCP leverages generative models, such as normalizing flows or diffusion models, to approximate conditional data distributions without requiring the sharing of raw data. This enables each client to locally calibrate conformal scores that reflect its unique uncertainty, while preserving global consistency through federated aggregation. Experiments on real datasets demonstrate that Fed-CCP achieves more adaptive prediction sets.
☆ To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models ICML 2025
We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when, and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets, and LM families demonstrate safe, effective, non-degrading error correction, and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose, and efficient approach to mechanistic activation steering.
comment: ICML 2025, 22 pages, 16 figures, 5 tables
☆ Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce RelPrompt, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at https://github.com/sungnyun/dualhyp.
comment: Preprint work
☆ BlendFL: Blended Federated Learning for Handling Multimodal Data Heterogeneity
One of the key challenges of collaborative machine learning, without data sharing, is multimodal data heterogeneity in real-world settings. While Federated Learning (FL) enables model training across multiple clients, existing frameworks, such as horizontal and vertical FL, are only effective in `ideal' settings that meet specific assumptions. Hence, they struggle to address scenarios where neither all modalities nor all samples are represented across the participating clients. To address this gap, we propose BlendFL, a novel FL framework that seamlessly blends the principles of horizontal and vertical FL in a synchronized and non-restrictive fashion despite the asymmetry across clients. Specifically, any client within BlendFL can benefit from either of the approaches, or both simultaneously, according to its available dataset. In addition, BlendFL features a decentralized inference mechanism, empowering clients to run collaboratively trained local models using available local data, thereby reducing latency and reliance on central servers for inference. We also introduce BlendAvg, an adaptive global model aggregation strategy that prioritizes collaborative model updates based on each client's performance. We trained and evaluated BlendFL and other state-of-the-art baselines on three classification tasks using a large-scale real-world multimodal medical dataset and a popular multimodal benchmark. Our results highlight BlendFL's superior performance for both multimodal and unimodal classification. Ablation studies demonstrate BlendFL's faster convergence compared to traditional approaches, accelerating collaborative learning. Overall, in our study we highlight the potential of BlendFL for handling multimodal data heterogeneity for collaborative learning in real-world settings where data privacy is crucial, such as in healthcare and finance.
☆ Hypernetworks for Perspectivist Adaptation
The task of perspective-aware classification introduces a bottleneck in terms of parametric efficiency that did not get enough recognition in existing studies. In this article, we aim to address this issue by applying an existing architecture, the hypernetwork+adapters combination, to perspectivist classification. Ultimately, we arrive at a solution that can compete with specialized models in adopting user perspectives on hate speech and toxicity detection, while also making use of considerably fewer parameters. Our solution is architecture-agnostic and can be applied to a wide range of base models out of the box.
comment: Accepted at NLPerspectives workshop 2025
☆ Rethinking Graph Domain Adaptation: A Spectral Contrastive Perspective ECML
Graph neural networks (GNNs) have achieved remarkable success in various domains, yet they often struggle with domain adaptation due to significant structural distribution shifts and insufficient exploration of transferable patterns. One of the main reasons behind this is that traditional approaches do not treat global and local patterns discriminatingly so that some local details in the graph may be violated after multi-layer GNN. Our key insight is that domain shifts can be better understood through spectral analysis, where low-frequency components often encode domain-invariant global patterns, and high-frequency components capture domain-specific local details. As such, we propose FracNet (\underline{\textbf{Fr}}equency \underline{\textbf{A}}ware \underline{\textbf{C}}ontrastive Graph \underline{\textbf{Net}}work) with two synergic modules to decompose the original graph into high-frequency and low-frequency components and perform frequency-aware domain adaption. Moreover, the blurring boundary problem of domain adaptation is improved by integrating with a contrastive learning framework. Besides the practical implication, we also provide rigorous theoretical proof to demonstrate the superiority of FracNet. Extensive experiments further demonstrate significant improvements over state-of-the-art approaches.
comment: This paper is accepted by ECML-PKDD 2025
☆ Automated Network Protocol Testing with LLM Agents
Network protocol testing is fundamental for modern network infrastructure. However, traditional network protocol testing methods are labor-intensive and error-prone, requiring manual interpretation of specifications, test case design, and translation into executable artifacts, typically demanding one person-day of effort per test case. Existing model-based approaches provide partial automation but still involve substantial manual modeling and expert intervention, leading to high costs and limited adaptability to diverse and evolving protocols. In this paper, we propose a first-of-its-kind system called NeTestLLM that takes advantage of multi-agent Large Language Models (LLMs) for end-to-end automated network protocol testing. NeTestLLM employs hierarchical protocol understanding to capture complex specifications, iterative test case generation to improve coverage, a task-specific workflow for executable artifact generation, and runtime feedback analysis for debugging and refinement. NeTestLLM has been deployed in a production environment for several months, receiving positive feedback from domain experts. In experiments, NeTestLLM generated 4,632 test cases for OSPF, RIP, and BGP, covering 41 historical FRRouting bugs compared to 11 by current national standards. The process of generating executable artifacts also improves testing efficiency by a factor of 8.65x compared to manual methods. NeTestLLM provides the first practical LLM-powered solution for automated end-to-end testing of heterogeneous network protocols.
☆ Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models
Vision-Language-Action (VLA) models have achieved revolutionary progress in robot learning, enabling robots to execute complex physical robot tasks from natural language instructions. Despite this progress, their adversarial robustness remains underexplored. In this work, we propose both adversarial patch attack and corresponding defense strategies for VLA models. We first introduce the Embedding Disruption Patch Attack (EDPA), a model-agnostic adversarial attack that generates patches directly placeable within the camera's view. In comparison to prior methods, EDPA can be readily applied to different VLA models without requiring prior knowledge of the model architecture, or the controlled robotic manipulator. EDPA constructs these patches by (i) disrupting the semantic alignment between visual and textual latent representations, and (ii) maximizing the discrepancy of latent representations between adversarial and corresponding clean visual inputs. Through the optimization of these objectives, EDPA distorts the VLA's interpretation of visual information, causing the model to repeatedly generate incorrect actions and ultimately result in failure to complete the given robotic task. To counter this, we propose an adversarial fine-tuning scheme for the visual encoder, in which the encoder is optimized to produce similar latent representations for both clean and adversarially perturbed visual inputs. Extensive evaluations on the widely recognized LIBERO robotic simulation benchmark demonstrate that EDPA substantially increases the task failure rate of cutting-edge VLA models, while our proposed defense effectively mitigates this degradation. The codebase is accessible via the homepage at https://edpa-attack.github.io/.
☆ Altruistic Ride Sharing: A Community-Driven Approach to Short-Distance Mobility
Urban mobility faces persistent challenges of congestion and fuel consumption, specifically when people choose a private, point-to-point commute option. Profit-driven ride-sharing platforms prioritize revenue over fairness and sustainability. This paper introduces Altruistic Ride-Sharing (ARS), a decentralized, peer-to-peer mobility framework where participants alternate between driver and rider roles based on altruism points rather than monetary incentives. The system integrates multi-agent reinforcement learning (MADDPG) for dynamic ride-matching, game-theoretic equilibrium guarantees for fairness, and a population model to sustain long-term balance. Using real-world New York City taxi data, we demonstrate that ARS reduces travel distance and emissions, increases vehicle utilization, and promotes equitable participation compared to both no-sharing and optimization-based baselines. These results establish ARS as a scalable, community-driven alternative to conventional ride-sharing, aligning individual behavior with collective urban sustainability goals.
comment: Submitted to IEEE Transactions on Intelligent Transportation Systems
☆ Sample-Centric Multi-Task Learning for Detection and Segmentation of Industrial Surface Defects
Industrial surface defect inspection for sample-wise quality control (QC) must simultaneously decide whether a given sample contains defects and localize those defects spatially. In real production lines, extreme foreground-background imbalance, defect sparsity with a long-tailed scale distribution, and low contrast are common. As a result, pixel-centric training and evaluation are easily dominated by large homogeneous regions, making it difficult to drive models to attend to small or low-contrast defects-one of the main bottlenecks for deployment. Empirically, existing models achieve strong pixel-overlap metrics (e.g., mIoU) but exhibit insufficient stability at the sample level, especially for sparse or slender defects. The root cause is a mismatch between the optimization objective and the granularity of QC decisions. To address this, we propose a sample-centric multi-task learning framework and evaluation suite. Built on a shared-encoder architecture, the method jointly learns sample-level defect classification and pixel-level mask localization. Sample-level supervision modulates the feature distribution and, at the gradient level, continually boosts recall for small and low-contrast defects, while the segmentation branch preserves boundary and shape details to enhance per-sample decision stability and reduce misses. For evaluation, we propose decision-linked metrics, Seg_mIoU and Seg_Recall, which remove the bias of classical mIoU caused by empty or true-negative samples and tightly couple localization quality with sample-level decisions. Experiments on two benchmark datasets demonstrate that our approach substantially improves the reliability of sample-level decisions and the completeness of defect localization.
☆ LLM-guided Hierarchical Retrieval
Modern IR systems are increasingly tasked with answering complex, multi-faceted queries that require deep reasoning rather than simple keyword or semantic matching. While LLM-based IR has shown great promise, the prevailing retrieve-then-rerank paradigm inherits the limitations of embedding-based retrieval; parametric generative approaches are difficult to update with new information; and long-context methods that place the entire corpus in context are computationally infeasible for large document collections. To address these challenges, we introduce LATTICE, a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity by imposing a semantic tree structure on the corpus. Our approach consists of two stages: (1) an offline phase that organizes the corpus into a semantic hierarchy via either a bottom-up agglomerative strategy or a top-down divisive strategy using multi-level summaries and (2) an online traversal phase where a search LLM navigates this tree. A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy, making cross-branch and cross-level comparisons difficult. To overcome this, we propose a traversal algorithm that estimates calibrated latent relevance scores from local LLM outputs and aggregates them into a global path relevance metric. Our training-free framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark, demonstrating up to 9% improvement in Recall@100 and 5% in nDCG@10 over the next best zero-shot baseline. Furthermore, compared to the fine-tuned SOTA method DIVER-v2, LATTICE attains comparable results on BRIGHT subsets that use a static corpus for evaluation.
☆ Towards Understanding Valuable Preference Data for Large Language Model Alignment
Large language model (LLM) alignment is typically achieved through learning from human preference comparisons, making the quality of preference data critical to its success. Existing studies often pre-process raw training datasets to identify valuable preference pairs using external reward models or off-the-shelf LLMs, achieving improved overall performance but rarely examining whether individual, selected data point is genuinely beneficial. We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF), which mitigates the over-scoring present in traditional measures and reveals that preference data quality is inherently a property of the model. In other words, a data pair that benefits one model may harm another. This leaves the need to improve the preference data selection approaches to be adapting to specific models. To this end, we introduce two candidate scoring functions (SFs) that are computationally simpler than TIF and positively correlated with it. They are also model dependent and can serve as potential indicators of individual data quality for preference data selection. Furthermore, we observe that these SFs inherently exhibit errors when compared to TIF. To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule that enables the models to achieve a more precise selection of valuable preference data. We conduct experiments across diverse alignment benchmarks and various LLM families, with results demonstrating that better alignment performance can be achieved using less data, showing the generality of our findings and new methods.
☆ Performance Evaluation of Ising and QUBO Variable Encodings in Boltzmann Machine Learning
We compare Ising ({-1,+1}) and QUBO ({0,1}) encodings for Boltzmann machine learning under a controlled protocol that fixes the model, sampler, and step size. Exploiting the identity that the Fisher information matrix (FIM) equals the covariance of sufficient statistics, we visualize empirical moments from model samples and reveal systematic, representation-dependent differences. QUBO induces larger cross terms between first- and second-order statistics, creating more small-eigenvalue directions in the FIM and lowering spectral entropy. This ill-conditioning explains slower convergence under stochastic gradient descent (SGD). In contrast, natural gradient descent (NGD)-which rescales updates by the FIM metric-achieves similar convergence across encodings due to reparameterization invariance. Practically, for SGD-based training, the Ising encoding provides more isotropic curvature and faster convergence; for QUBO, centering/scaling or NGD-style preconditioning mitigates curvature pathologies. These results clarify how representation shapes information geometry and finite-time learning dynamics in Boltzmann machines and yield actionable guidelines for variable encoding and preprocessing.
comment: 12pages, 6figures
☆ CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection
Healthcare fraud detection remains a critical challenge due to limited availability of labeled data, constantly evolving fraud tactics, and the high dimensionality of medical records. Traditional supervised methods are challenged by extreme label scarcity, while purely unsupervised approaches often fail to capture clinically meaningful anomalies. In this work, we introduce CleverCatch, a knowledge-guided weak supervision model designed to detect fraudulent prescription behaviors with improved accuracy and interpretability. Our approach integrates structured domain expertise into a neural architecture that aligns rules and data samples within a shared embedding space. By training encoders jointly on synthetic data representing both compliance and violation, CleverCatch learns soft rule embeddings that generalize to complex, real-world datasets. This hybrid design enables data-driven learning to be enhanced by domain-informed constraints, bridging the gap between expert heuristics and machine learning. Experiments on the large-scale real-world dataset demonstrate that CleverCatch outperforms four state-of-the-art anomaly detection baselines, yielding average improvements of 1.3\% in AUC and 3.4\% in recall. Our ablation study further highlights the complementary role of expert rules, confirming the adaptability of the framework. The results suggest that embedding expert rules into the learning process not only improves detection accuracy but also increases transparency, offering an interpretable approach for high-stakes domains such as healthcare fraud detection.
☆ Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences
The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.
☆ Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning
The rapid increase in multimodal data availability has sparked significant interest in cross-modal knowledge distillation (KD) techniques, where richer "teacher" modalities transfer information to weaker "student" modalities during model training to improve performance. However, despite successes across various applications, cross-modal KD does not always result in improved outcomes, primarily due to a limited theoretical understanding that could inform practice. To address this gap, we introduce the Cross-modal Complementarity Hypothesis (CCH): we propose that cross-modal KD is effective when the mutual information between teacher and student representations exceeds the mutual information between the student representation and the labels. We theoretically validate the CCH in a joint Gaussian model and further confirm it empirically across diverse multimodal datasets, including image, text, video, audio, and cancer-related omics data. Our study establishes a novel theoretical framework for understanding cross-modal KD and offers practical guidelines based on the CCH criterion to select optimal teacher modalities for improving the performance of weaker modalities.
☆ Universally Invariant Learning in Equivariant GNNs
Equivariant Graph Neural Networks (GNNs) have demonstrated significant success across various applications. To achieve completeness -- that is, the universal approximation property over the space of equivariant functions -- the network must effectively capture the intricate multi-body interactions among different nodes. Prior methods attain this via deeper architectures, augmented body orders, or increased degrees of steerable features, often at high computational cost and without polynomial-time solutions. In this work, we present a theoretically grounded framework for constructing complete equivariant GNNs that is both efficient and practical. We prove that a complete equivariant GNN can be achieved through two key components: 1) a complete scalar function, referred to as the canonical form of the geometric graph; and 2) a full-rank steerable basis set. Leveraging this finding, we propose an efficient algorithm for constructing complete equivariant GNNs based on two common models: EGNN and TFN. Empirical results demonstrate that our model demonstrates superior completeness and excellent performance with only a few layers, thereby significantly reducing computational overhead while maintaining strong practical efficacy.
☆ Behavioral Embeddings of Programs: A Quasi-Dynamic Approach for Optimization Prediction
Learning effective numerical representations, or embeddings, of programs is a fundamental prerequisite for applying machine learning to automate and enhance compiler optimization. Prevailing paradigms, however, present a dilemma. Static representations, derived from source code or intermediate representation (IR), are efficient and deterministic but offer limited insight into how a program will behave or evolve under complex code transformations. Conversely, dynamic representations, which rely on runtime profiling, provide profound insights into performance bottlenecks but are often impractical for large-scale tasks due to prohibitive overhead and inherent non-determinism. This paper transcends this trade-off by proposing a novel quasi-dynamic framework for program representation. The core insight is to model a program's optimization sensitivity. We introduce the Program Behavior Spectrum, a new representation generated by probing a program's IR with a diverse set of optimization sequences and quantifying the resulting changes in its static features. To effectively encode this high-dimensional, continuous spectrum, we pioneer a compositional learning approach. Product Quantization is employed to discretize the continuous reaction vectors into structured, compositional sub-words. Subsequently, a multi-task Transformer model, termed PQ-BERT, is pre-trained to learn the deep contextual grammar of these behavioral codes. Comprehensive experiments on two representative compiler optimization tasks -- Best Pass Prediction and -Oz Benefit Prediction -- demonstrate that our method outperforms state-of-the-art static baselines. Our code is publicly available at https://github.com/Panhaolin2001/PREP/.
☆ D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations
The computation and memory costs of large language models kept increasing over last decade, which reached over the scale of 1T parameters. To address the challenges from the large scale models, model compression techniques such as low-rank decomposition have been explored. Previous model decomposition works have focused on weight decomposition to avoid costly runtime decomposition, whose latency often significantly exceeds the benefits from decomposition (e.g., 38% more end-to-end latency when running Llama2-7b on A100 with 4K sequence length with activation decomposition compared to no decomposition). In this work, we debunk such observations and report that the input decomposition can be significantly beneficial with a proper choice of decomposition algorithm and hardware support. We adopt progressive decomposition algorithm, Lanczos algorithm, and design a co-accelerator architecture for the decomposition algorithm. To address the memory- boundness of the decomposition operation, we introduce a novel compute replication methodology that moves the op- eration toward compute-bound region, which enables 6.2x speedup in our evaluation. We also develop an output shape- preserving computation scheme that eliminates decomposi- tion costs in consecutive layers. To compensate model quality loss from compression, we introduce a multi-track decom- position approach that separately handles outlier channels for high accuracy and low perplexity with minimal compu- tational costs. Combined together, our accelerator, D-com, provides 22% end-to-end latency improvements compared to A100 GPU at the cost of small model quality degradation (e.g., 3% on AI2 Reasoning Challenge task).
comment: 12 pages, 13 figures
☆ Convergence, design and training of continuous-time dropout as a random batch method
We study dropout regularization in continuous-time models through the lens of random-batch methods -- a family of stochastic sampling schemes originally devised to reduce the computational cost of interacting particle systems. We construct an unbiased, well-posed estimator that mimics dropout by sampling neuron batches over time intervals of length $h$. Trajectory-wise convergence is established with linear rate in $h$ for the expected uniform error. At the distribution level, we establish stability for the associated continuity equation, with total-variation error of order $h^{1/2}$ under mild moment assumptions. During training with fixed batch sampling across epochs, a Pontryagin-based adjoint analysis bounds deviations in the optimal cost and control, as well as in gradient-descent iterates. On the design side, we compare convergence rates for canonical batch sampling schemes, recover standard Bernoulli dropout as a special case, and derive a cost--accuracy trade-off yielding a closed-form optimal $h$. We then specialize to a single-layer neural ODE and validate the theory on classification and flow matching, observing the predicted rates, regularization effects, and favorable runtime and memory profiles.
comment: 37 pages, 20 figures
☆ Cluster-Based Client Selection for Dependent Multi-Task Federated Learning in Edge Computing
We study the client selection problem in Federated Learning (FL) within mobile edge computing (MEC) environments, particularly under the dependent multi-task settings, to reduce the total time required to complete various learning tasks. We propose CoDa-FL, a Cluster-oriented and Dependency-aware framework designed to reduce the total required time via cluster-based client selection and dependent task assignment. Our approach considers Earth Mover's Distance (EMD) for client clustering based on their local data distributions to lower computational cost and improve communication efficiency. We derive a direct and explicit relationship between intra-cluster EMD and the number of training rounds required for convergence, thereby simplifying the otherwise complex process of obtaining the optimal solution. Additionally, we incorporate a directed acyclic graph-based task scheduling mechanism to effectively manage task dependencies. Through numerical experiments, we validate that our proposed CoDa-FL outperforms existing benchmarks by achieving faster convergence, lower communication and computational costs, and higher learning accuracy under heterogeneous MEC settings.
comment: 6 pages
☆ On the Reasoning Abilities of Masked Diffusion Language Models
Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.
☆ Neural Triangular Transport Maps: A New Approach Towards Sampling in Lattice QCD
Lattice field theories are fundamental testbeds for computational physics; yet, sampling their Boltzmann distributions remains challenging due to multimodality and long-range correlations. While normalizing flows offer a promising alternative, their application to large lattices is often constrained by prohibitive memory requirements and the challenge of maintaining sufficient model expressivity. We propose sparse triangular transport maps that explicitly exploit the conditional independence structure of the lattice graph under periodic boundary conditions using monotone rectified neural networks (MRNN). We introduce a comprehensive framework for triangular transport maps that navigates the fundamental trade-off between \emph{exact sparsity} (respecting marginal conditional independence in the target distribution) and \emph{approximate sparsity} (computational tractability without fill-ins). Restricting each triangular map component to a local past enables site-wise parallel evaluation and linear time complexity in lattice size $N$, while preserving the expressive, invertible structure. Using $\phi^4$ in two dimensions as a controlled setting, we analyze how node labelings (orderings) affect the sparsity and performance of triangular maps. We compare against Hybrid Monte Carlo (HMC) and established flow approaches (RealNVP).
☆ ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models
Uncertainty Quantification (UQ) is a promising approach to improve model reliability, yet quantifying the uncertainty of Large Language Models (LLMs) is non-trivial. In this work, we establish a connection between the uncertainty of LLMs and their invariance under semantic-preserving intervention from a causal perspective. Building on this foundation, we propose a novel grey-box uncertainty quantification method that measures the variation in model outputs before and after the semantic-preserving intervention. Through theoretical justification, we show that our method provides an effective estimate of epistemic uncertainty. Our extensive experiments, conducted across various LLMs and a variety of question-answering (QA) datasets, demonstrate that our method excels not only in terms of effectiveness but also in computational efficiency.
Gaussian Certified Unlearning in High Dimensions: A Hypothesis Testing Approach
Machine unlearning seeks to efficiently remove the influence of selected data while preserving generalization. Significant progress has been made in low dimensions $(p \ll n)$, but high dimensions pose serious theoretical challenges as standard optimization assumptions of $\Omega(1)$ strong convexity and $O(1)$ smoothness of the per-example loss $f$ rarely hold simultaneously in proportional regimes $(p\sim n)$. In this work, we introduce $\varepsilon$-Gaussian certifiability, a canonical and robust notion well-suited to high-dimensional regimes, that optimally captures a broad class of noise adding mechanisms. Then we theoretically analyze the performance of a widely used unlearning algorithm based on one step of the Newton method in the high-dimensional setting described above. Our analysis shows that a single Newton step, followed by a well-calibrated Gaussian noise, is sufficient to achieve both privacy and accuracy in this setting. This result stands in sharp contrast to the only prior work that analyzes machine unlearning in high dimensions \citet{zou2025certified}, which relaxes some of the standard optimization assumptions for high-dimensional applicability, but operates under the notion of $\varepsilon$-certifiability. That work concludes %that a single Newton step is insufficient even for removing a single data point, and that at least two steps are required to ensure both privacy and accuracy. Our result leads us to conclude that the discrepancy in the number of steps arises because of the sub optimality of the notion of $\varepsilon$-certifiability and its incompatibility with noise adding mechanisms, which $\varepsilon$-Gaussian certifiability is able to overcome optimally.
comment: Comments welcome!
☆ A Multi-dimensional Semantic Surprise Framework Based on Low-Entropy Semantic Manifolds for Fine-Grained Out-of-Distribution Detection
Out-of-Distribution (OOD) detection is a cornerstone for the safe deployment of AI systems in the open world. However, existing methods treat OOD detection as a binary classification problem, a cognitive flattening that fails to distinguish between semantically close (Near-OOD) and distant (Far-OOD) unknown risks. This limitation poses a significant safety bottleneck in applications requiring fine-grained risk stratification. To address this, we propose a paradigm shift from a conventional probabilistic view to a principled information-theoretic framework. We formalize the core task as quantifying the Semantic Surprise of a new sample and introduce a novel ternary classification challenge: In-Distribution (ID) vs. Near-OOD vs. Far-OOD. The theoretical foundation of our work is the concept of Low-Entropy Semantic Manifolds, which are explicitly structured to reflect the data's intrinsic semantic hierarchy. To construct these manifolds, we design a Hierarchical Prototypical Network. We then introduce the Semantic Surprise Vector (SSV), a universal probe that decomposes a sample's total surprise into three complementary and interpretable dimensions: conformity, novelty, and ambiguity. To evaluate performance on this new task, we propose the Normalized Semantic Risk (nSR), a cost-sensitive metric. Experiments demonstrate that our framework not only establishes a new state-of-the-art (sota) on the challenging ternary task, but its robust representations also achieve top results on conventional binary benchmarks, reducing the False Positive Rate by over 60% on datasets like LSUN.
☆ DeepCausalMMM: A Deep Learning Framework for Marketing Mix Modeling with Causal Inference
Marketing Mix Modeling (MMM) is a statistical technique used to estimate the impact of marketing activities on business outcomes such as sales, revenue, or customer visits. Traditional MMM approaches often rely on linear regression or Bayesian hierarchical models that assume independence between marketing channels and struggle to capture complex temporal dynamics and non-linear saturation effects [@Hanssens2005; @Ng2021Bayesian]. DeepCausalMMM is a Python package that addresses these limitations by combining deep learning, causal inference, and advanced marketing science. The package uses Gated Recurrent Units (GRUs) to automatically learn temporal patterns such as adstock (carryover effects) and lag, while simultaneously learning statistical dependencies and potential causal structures between marketing channels through Directed Acyclic Graph (DAG) learning [@Zheng2018NOTEARS; @Gong2024CausalMMM]. Additionally, it implements Hill equation-based saturation curves to model diminishing returns and optimize budget allocation. Key innovations include: (1) a data-driven design where hyperparameters and transformations (e.g., adstock decay, saturation curves) are learned or estimated from data with sensible defaults, rather than requiring fixed heuristics or manual specification, (2) multi-region modeling with both shared and region-specific parameters, (3) robust statistical methods including Huber loss and advanced regularization, (4) comprehensive response curve analysis for understanding channel saturation, and (5) an extensive visualization suite with 14+ interactive dashboards for business insights.
comment: Submitted to JOSS (Journal of Open Source Software) Journal for Publishing. It's currently in the Pre-review stage. Please note that Author has no middle name. Last name is 'Puttaparthi Tirumala' (it's a two-part surname)
☆ GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models
Modern large language models leverage Mixture-of-Experts (MoE) architectures for efficient scaling, but face a critical challenge: functionally similar experts are often selected simultaneously, creating redundant computation and limiting effective model capacity. Existing auxiliary balance loss methods improve token distribution but fail to address the underlying expert diversity problem. We introduce GatePro, a novel parameter-free method that directly promotes expert selection diversity. GatePro identifies the most similar expert pairs and introduces localized competition mechanisms, preventing redundant expert co-activation while maintaining natural expert specialization. Our comprehensive evaluation demonstrates GatePro's effectiveness across model scales and benchmarks. Analysis demonstrates GatePro's ability to achieve enhanced expert diversity, where experts develop more distinct and complementary capabilities, avoiding functional redundancy. This approach can be deployed hot-swappable during any training phase without additional learnable parameters, offering a practical solution for improving MoE effectiveness.
☆ Transformer-based Scalable Beamforming Optimization via Deep Residual Learning
We develop an unsupervised deep learning framework for downlink beamforming in large-scale MU-MISO channels. The model is trained offline, allowing real-time inference through lightweight feedforward computations in dynamic communication environments. Following the learning-to-optimize (L2O) paradigm, a multi-layer Transformer iteratively refines both channel and beamformer features via residual connections. To enhance training, three strategies are introduced: (i) curriculum learning (CL) to improve early-stage convergence and avoid local optima, (ii) semi-amortized learning to refine each Transformer block with a few gradient ascent steps, and (iii) sliding-window training to stabilize optimization by training only a subset of Transformer blocks at a time. Extensive simulations show that the proposed scheme outperforms existing baselines at low-to-medium SNRs and closely approaches WMMSE performance at high SNRs, while achieving substantially faster inference than iterative and online learning approaches.
comment: 7 pages, 5 figures
☆ NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models
Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.
☆ Absolute indices for determining compactness, separability and number of clusters
Finding "true" clusters in a data set is a challenging problem. Clustering solutions obtained using different models and algorithms do not necessarily provide compact and well-separated clusters or the optimal number of clusters. Cluster validity indices are commonly applied to identify such clusters. Nevertheless, these indices are typically relative, and they are used to compare clustering algorithms or choose the parameters of a clustering algorithm. Moreover, the success of these indices depends on the underlying data structure. This paper introduces novel absolute cluster indices to determine both the compactness and separability of clusters. We define a compactness function for each cluster and a set of neighboring points for cluster pairs. This function is utilized to determine the compactness of each cluster and the whole cluster distribution. The set of neighboring points is used to define the margin between clusters and the overall distribution margin. The proposed compactness and separability indices are applied to identify the true number of clusters. Using a number of synthetic and real-world data sets, we demonstrate the performance of these new indices and compare them with other widely-used cluster validity indices.
comment: 25 pages, 11 figures, 9 tables
☆ True Self-Supervised Novel View Synthesis is Transferable
In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.
☆ Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games
Reverse Kullback-Leibler (KL) divergence-based regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $\beta$ in addition to the standard $\widetilde{\mathcal{O}}(\sqrt{T})$ regret independent of $\beta$ which is attained in both regularized and unregularized settings
☆ AMORE: Adaptive Multi-Output Operator Network for Stiff Chemical Kinetics
Time integration of stiff systems is a primary source of computational cost in combustion, hypersonics, and other reactive transport systems. This stiffness can introduce time scales significantly smaller than those associated with other physical processes, requiring extremely small time steps in explicit schemes or computationally intensive implicit methods. Consequently, strategies to alleviate challenges posed by stiffness are important. While neural operators (DeepONets) can act as surrogates for stiff kinetics, a reliable operator learning strategy is required to appropriately account for differences in the error between output variables and samples. Here, we develop AMORE, Adaptive Multi-Output Operator Network, a framework comprising an operator capable of predicting multiple outputs and adaptive loss functions ensuring reliable operator learning. The operator predicts all thermochemical states from given initial conditions. We propose two adaptive loss functions within the framework, considering each state variable's and sample's error to penalize the loss function. We designed the trunk to automatically satisfy Partition of Unity. To enforce unity mass-fraction constraint exactly, we propose an invertible analytical map that transforms the $n$-dimensional species mass-fraction vector into an ($n-1$)-dimensional space, where DeepONet training is performed. We consider two-step training for DeepONet for multiple outputs and extend adaptive loss functions for trunk and branch training. We demonstrate the efficacy and applicability of our models through two examples: the syngas (12 states) and GRI-Mech 3.0 (24 active states out of 54). The proposed DeepONet will be a backbone for future CFD studies to accelerate turbulent combustion simulations. AMORE is a general framework, and here, in addition to DeepONet, we also demonstrate it for FNO.
☆ Reciprocal Space Attention for Learning Long-Range Interactions
Machine learning interatomic potentials (MLIPs) have revolutionized the modeling of materials and molecules by directly fitting to ab initio data. However, while these models excel at capturing local and semi-local interactions, they often prove insufficient when an explicit and efficient treatment of long-range interactions is required. To address this limitation, we introduce Reciprocal-Space Attention (RSA), a framework designed to capture long-range interactions in the Fourier domain. RSA can be integrated with any existing local or semi-local MLIP framework. The central contribution of this work is the mapping of a linear-scaling attention mechanism into Fourier space, enabling the explicit modeling of long-range interactions such as electrostatics and dispersion without relying on predefined charges or other empirical assumptions. We demonstrate the effectiveness of our method as a long-range correction to the MACE backbone across diverse benchmarks, including dimer binding curves, dispersion-dominated layered phosphorene exfoliation, and the molecular dipole density of bulk water. Our results show that RSA consistently captures long-range physics across a broad range of chemical and materials systems. The code and datasets for this work is available at https://github.com/rfhari/reciprocal_space_attention
comment: 13 pages including references with 6 figures and 1 table
☆ Time-Varying Optimization for Streaming Data Via Temporal Weighting
Classical optimization theory deals with fixed, time-invariant objective functions. However, time-varying optimization has emerged as an important subject for decision-making in dynamic environments. In this work, we study the problem of learning from streaming data through a time-varying optimization lens. Unlike prior works that focus on generic formulations, we introduce a structured, \emph{weight-based} formulation that explicitly captures the streaming-data origin of the time-varying objective, where at each time step, an agent aims to minimize a weighted average loss over all the past data samples. We focus on two specific weighting strategies: (1) uniform weights, which treat all samples equally, and (2) discounted weights, which geometrically decay the influence of older data. For both schemes, we derive tight bounds on the ``tracking error'' (TE), defined as the deviation between the model parameter and the time-varying optimum at a given time step, under gradient descent (GD) updates. We show that under uniform weighting, the TE vanishes asymptotically with a $\mathcal{O}(1/t)$ decay rate, whereas discounted weighting incurs a nonzero error floor controlled by the discount factor and the number of gradient updates performed at each time step. Our theoretical findings are validated through numerical simulations.
comment: Accepted at IEEE Asilomar, 2025
☆ An Operational Deep Learning System for Satellite-Based High-Resolution Global Nowcasting
Precipitation nowcasting, which predicts rainfall up to a few hours ahead, is a critical tool for vulnerable communities in the Global South frequently exposed to intense, rapidly developing storms. Timely forecasts provide a crucial window to protect lives and livelihoods. Traditional numerical weather prediction (NWP) methods suffer from high latency, low spatial and temporal resolution, and significant gaps in accuracy across the world. Recent machine learning-based nowcasting methods, common in the Global North, cannot be extended to the Global South due to extremely sparse radar coverage. We present Global MetNet, an operational global machine learning nowcasting model. It leverages the Global Precipitation Mission's CORRA dataset, geostationary satellite data, and global NWP data to predict precipitation for the next 12 hours. The model operates at a high resolution of approximately 0.05{\deg} (~5km) spatially and 15 minutes temporally. Global MetNet significantly outperforms industry-standard hourly forecasts and achieves significantly higher skill, making forecasts useful over a much larger area of the world than previously available. Our model demonstrates better skill in data-sparse regions than even the best high-resolution NWP models achieve in the US. Validated using ground radar and satellite data, it shows significant improvements across key metrics like the critical success index and fractions skill score for all precipitation rates and lead times. Crucially, our model generates forecasts in under a minute, making it readily deployable for real-time applications. It is already deployed for millions of users on Google Search. This work represents a key step in reducing global disparities in forecast quality and integrating sparse, high-resolution satellite observations into weather forecasting.
☆ Optimal Control Theoretic Neural Optimizer: From Backpropagation to Dynamic Programming
Optimization of deep neural networks (DNNs) has been a driving force in the advancement of modern machine learning and artificial intelligence. With DNNs characterized by a prolonged sequence of nonlinear propagation, determining their optimal parameters given an objective naturally fits within the framework of Optimal Control Programming. Such an interpretation of DNNs as dynamical systems has proven crucial in offering a theoretical foundation for principled analysis from numerical equations to physics. In parallel to these theoretical pursuits, this paper focuses on an algorithmic perspective. Our motivated observation is the striking algorithmic resemblance between the Backpropagation algorithm for computing gradients in DNNs and the optimality conditions for dynamical systems, expressed through another backward process known as dynamic programming. Consolidating this connection, where Backpropagation admits a variational structure, solving an approximate dynamic programming up to the first-order expansion leads to a new class of optimization methods exploring higher-order expansions of the Bellman equation. The resulting optimizer, termed Optimal Control Theoretic Neural Optimizer (OCNOpt), enables rich algorithmic opportunities, including layer-wise feedback policies, game-theoretic applications, and higher-order training of continuous-time models such as Neural ODEs. Extensive experiments demonstrate that OCNOpt improves upon existing methods in robustness and efficiency while maintaining manageable computational complexity, paving new avenues for principled algorithmic design grounded in dynamical systems and optimal control theory.
☆ Towards Reversible Model Merging For Low-rank Weights
Model merging aims to combine multiple fine-tuned models into a single set of weights that performs well across all source tasks. While prior work has shown that merging can approximate the performance of individual fine-tuned models for each task, it largely overlooks scenarios where models are compressed into low-rank representations, either through low-rank adaptation (LoRA) or post-training singular value decomposition (SVD). We first demonstrate that applying conventional merging methods to low-rank weights leads to severe performance degradation in the merged model. Motivated by this phenomenon, we propose a fundamentally different approach: instead of collapsing all adapters into one set of weights, we construct a compact basis (e.g., an equivalent of holding two or more models) from which original task-specific models can be recovered via linear combination. This reframes merging as generating a reconstruction-capable model space rather than producing a single merged model. Crucially, this allows us to ``revert'' to each individual model when needed, recognizing that no merged model can consistently outperform one specialized for its task. Building on this insight, we introduce our method, Reversible Model Merging (RMM), an efficient, data-free, and flexible method that provides a closed-form solution for selecting the optimal basis of model weights and task-specific coefficients for linear combination. Extensive experiments across diverse datasets and model scales demonstrate that RMM consistently outperforms existing merging approaches, preserving the performance of low-rank compressed models by a significant margin.
☆ Data Understanding Survey: Pursuing Improved Dataset Characterization Via Tensor-based Methods
In the evolving domains of Machine Learning and Data Analytics, existing dataset characterization methods such as statistical, structural, and model-based analyses often fail to deliver the deep understanding and insights essential for innovation and explainability. This work surveys the current state-of-the-art conventional data analytic techniques and examines their limitations, and discusses a variety of tensor-based methods and how these may provide a more robust alternative to traditional statistical, structural, and model-based dataset characterization techniques. Through examples, we illustrate how tensor methods unveil nuanced data characteristics, offering enhanced interpretability and actionable intelligence. We advocate for the adoption of tensor-based characterization, promising a leap forward in understanding complex datasets and paving the way for intelligent, explainable data-driven discoveries.
comment: 20 pages, 8 figures, Pre-print
☆ On Evaluating Loss Functions for Stock Ranking: An Empirical Analysis With Transformer Model CIKM 2025
Quantitative trading strategies rely on accurately ranking stocks to identify profitable investments. Effective portfolio management requires models that can reliably order future stock returns. Transformer models are promising for understanding financial time series, but how different training loss functions affect their ability to rank stocks well is not yet fully understood. Financial markets are challenging due to their changing nature and complex relationships between stocks. Standard loss functions, which aim for simple prediction accuracy, often aren't enough. They don't directly teach models to learn the correct order of stock returns. While many advanced ranking losses exist from fields such as information retrieval, there hasn't been a thorough comparison to see how well they work for ranking financial returns, especially when used with modern Transformer models for stock selection. This paper addresses this gap by systematically evaluating a diverse set of advanced loss functions including pointwise, pairwise, listwise for daily stock return forecasting to facilitate rank-based portfolio selection on S&P 500 data. We focus on assessing how each loss function influences the model's ability to discern profitable relative orderings among assets. Our research contributes a comprehensive benchmark revealing how different loss functions impact a model's ability to learn cross-sectional and temporal patterns crucial for portfolio selection, thereby offering practical guidance for optimizing ranking-based trading strategies.
comment: This paper has been submitted to CIKM 2025
☆ Combining Reinforcement Learning and Behavior Trees for NPCs in Video Games with AMD Schola
While the rapid advancements in the reinforcement learning (RL) research community have been remarkable, the adoption in commercial video games remains slow. In this paper, we outline common challenges the Game AI community faces when using RL-driven NPCs in practice, and highlight the intersection of RL with traditional behavior trees (BTs) as a crucial juncture to be explored further. Although the BT+RL intersection has been suggested in several research papers, its adoption is rare. We demonstrate the viability of this approach using AMD Schola -- a plugin for training RL agents in Unreal Engine -- by creating multi-task NPCs in a complex 3D environment inspired by the commercial video game ``The Last of Us". We provide detailed methodologies for jointly training RL models with BTs while showcasing various skills.
comment: 8 pages, 4 figures, 5 tables
☆ CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization
In this work, we introduce CodeEvolve, an open-source evolutionary coding agent that unites Large Language Models (LLMs) with genetic algorithms to solve complex computational problems. Our framework adapts powerful evolutionary concepts to the LLM domain, building upon recent methods for generalized scientific discovery. CodeEvolve employs an island-based genetic algorithm to maintain population diversity and increase throughput, introduces a novel inspiration-based crossover mechanism that leverages the LLMs context window to combine features from successful solutions, and implements meta-prompting strategies for dynamic exploration of the solution space. We conduct a rigorous evaluation of CodeEvolve on a subset of the mathematical benchmarks used to evaluate Google DeepMind's closed-source AlphaEvolve. Our findings show that our method surpasses AlphaEvolve's performance on several challenging problems. To foster collaboration and accelerate progress, we release our complete framework as an open-source repository.
comment: 11 pages, 9 figures, 2 tables
☆ PoissonNet: A Local-Global Approach for Learning on Surfaces SIGGRAPH
Many network architectures exist for learning on meshes, yet their constructions entail delicate trade-offs between difficulty learning high-frequency features, insufficient receptive field, sensitivity to discretization, and inefficient computational overhead. Drawing from classic local-global approaches in mesh processing, we introduce PoissonNet, a novel neural architecture that overcomes all of these deficiencies by formulating a local-global learning scheme, which uses Poisson's equation as the primary mechanism for feature propagation. Our core network block is simple; we apply learned local feature transformations in the gradient domain of the mesh, then solve a Poisson system to propagate scalar feature updates across the surface globally. Our local-global learning framework preserves the features's full frequency spectrum and provides a truly global receptive field, while remaining agnostic to mesh triangulation. Our construction is efficient, requiring far less compute overhead than comparable methods, which enables scalability -- both in the size of our datasets, and the size of individual training samples. These qualities are validated on various experiments where, compared to previous intrinsic architectures, we attain state-of-the-art performance on semantic segmentation and parameterizing highly-detailed animated surfaces. Finally, as a central application of PoissonNet, we show its ability to learn deformations, significantly outperforming state-of-the-art architectures that learn on surfaces.
comment: In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 2025, 16 pages
☆ High-Dimensional BWDM: A Robust Nonparametric Clustering Validation Index for Large-Scale Data
Determining the appropriate number of clusters in unsupervised learning is a central problem in statistics and data science. Traditional validity indices such as Calinski-Harabasz, Silhouette, and Davies-Bouldin-depend on centroid-based distances and therefore degrade in high-dimensional or contaminated data. This paper proposes a new robust, nonparametric clustering validation framework, the High-Dimensional Between-Within Distance Median (HD-BWDM), which extends the recently introduced BWDM criterion to high-dimensional spaces. HD-BWDM integrates random projection and principal component analysis to mitigate the curse of dimensionality and applies trimmed clustering and medoid-based distances to ensure robustness against outliers. We derive theoretical results showing consistency and convergence under Johnson-Lindenstrauss embeddings. Extensive simulations demonstrate that HD-BWDM remains stable and interpretable under high-dimensional projections and contamination, providing a robust alternative to traditional centroid-based validation criteria. The proposed method provides a theoretically grounded, computationally efficient stopping rule for nonparametric clustering in modern high-dimensional applications.
☆ Inferred global dense residue transition graphs from primary structure sequences enable protein interaction prediction via directed graph convolutional neural networks
Introduction Accurate prediction of protein-protein interactions (PPIs) is crucial for understanding cellular functions and advancing drug development. Existing in-silico methods use direct sequence embeddings from Protein Language Models (PLMs). Others use Graph Neural Networks (GNNs) for 3D protein structures. This study explores less computationally intensive alternatives. We introduce a novel framework for downstream PPI prediction through link prediction. Methods We introduce a two-stage graph representation learning framework, ProtGram-DirectGCN. First, we developed ProtGram. This approach models a protein's primary structure as a hierarchy of globally inferred n-gram graphs. In these graphs, residue transition probabilities define edge weights. Each edge connects a pair of residues in a directed graph. The probabilities are aggregated from a large corpus of sequences. Second, we propose DirectGCN, a custom directed graph convolutional neural network. This model features a unique convolutional layer. It processes information through separate path-specific transformations: incoming, outgoing, and undirected. A shared transformation is also applied. These paths are combined via a learnable gating mechanism. We apply DirectGCN to ProtGram graphs to learn residue-level embeddings. These embeddings are pooled via attention to generate protein-level embeddings for prediction. Results We first established the efficacy of DirectGCN on standard node classification benchmarks. Its performance matches established methods on general datasets. The model excels at complex, directed graphs with dense, heterophilic structures. When applied to PPI prediction, the full ProtGram-DirectGCN framework delivers robust predictive power. This strong performance holds even with limited training data.
comment: under review in Frontiers in Bioinformatics
☆ Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks
The p-persistent CSMA protocol is central to random-access MAC analysis, but predicting saturation throughput in heterogeneous multi-hop wireless networks remains a hard problem. Simplified models that assume a single, shared interference domain can underestimate throughput by 48--62\% in sparse topologies. Exact Markov-chain analyses are accurate but scale exponentially in computation time, making them impractical for large networks. These computational barriers motivate structural machine learning approaches like GNNs for scalable throughput prediction in general network topologies. Yet off-the-shelf GNNs struggle here: a standard GCN yields 63.94\% normalized mean absolute error (NMAE) on heterogeneous networks because symmetric normalization conflates a node's direct interference with higher-order, cascading effects that pertain to how interference propagates over the network graph. Building on these insights, we propose the Decoupled Graph Convolutional Network (D-GCN), a novel architecture that explicitly separates processing of a node's own transmission probability from neighbor interference effects. D-GCN replaces mean aggregation with learnable attention, yielding interpretable, per-neighbor contribution weights while capturing complex multihop interference patterns. D-GCN attains 3.3\% NMAE, outperforms strong baselines, remains tractable even when exact analytical methods become computationally infeasible, and enables gradient-based network optimization that achieves within 1\% of theoretical optima.
☆ Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL
In this work, we take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning. We study Single-Goal Contrastive Reinforcement Learning (SGCRL), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula. We combine theoretical analysis of the algorithm's objective function with controlled experiments to understand what drives its exploration. We show that SGCRL maximizes implicit rewards shaped by its learned representations. These representations automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter. Our experiments also demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation. Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.
☆ Neural Network-enabled Domain-consistent Robust Optimisation for Global CO$_2$ Reduction Potential of Gas Power Plants
We introduce a neural network-driven robust optimisation framework that integrates data-driven domain as a constraint into the nonlinear programming technique, addressing the overlooked issue of domain-inconsistent solutions arising from the interaction of parametrised neural network models with optimisation solvers. Applied to a 1180 MW capacity combined cycle gas power plant, our framework delivers domain-consistent robust optimal solutions that achieve a verified 0.76 percentage point mean improvement in energy efficiency. For the first time, scaling this efficiency gain to the global fleet of gas power plants, we estimate an annual 26 Mt reduction potential in CO$_2$ (with 10.6 Mt in Asia, 9.0 Mt in the Americas, and 4.5 Mt in Europe). These results underscore the synergetic role of machine learning in delivering near-term, scalable decarbonisation pathways for global climate action.
☆ David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation
Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.
☆ Briding Diffusion Posterior Sampling and Monte Carlo methods: a survey
Diffusion models enable the synthesis of highly accurate samples from complex distributions and have become foundational in generative modeling. Recently, they have demonstrated significant potential for solving Bayesian inverse problems by serving as priors. This review offers a comprehensive overview of current methods that leverage \emph{pre-trained} diffusion models alongside Monte Carlo methods to address Bayesian inverse problems without requiring additional training. We show that these methods primarily employ a \emph{twisting} mechanism for the intermediate distributions within the diffusion process, guiding the simulations toward the posterior distribution. We describe how various Monte Carlo methods are then used to aid in sampling from these twisted distributions.
☆ Extracting latent representations from X-ray spectra. Classification, regression, and accretion signatures of Chandra sources
The study of X-ray spectra is crucial to understanding the physical nature of astrophysical sources. Machine learning methods can extract compact and informative representations of data from large datasets. The Chandra Source Catalog (CSC) provides a rich archive of X-ray spectral data, which remains largely underexplored in this context. This work aims to develop a compact and physically meaningful representation of Chandra X-ray spectra using deep learning. To verify that the learned representation captures relevant information, we evaluate it through classification, regression, and interpretability analyses. We use a transformer-based autoencoder to compress X-ray spectra. The input spectra, drawn from the CSC, include only high-significance detections. Astrophysical source types and physical summary statistics are compiled from external catalogs. We evaluate the learned representation in terms of spectral reconstruction accuracy, clustering performance on 8 known astrophysical source classes, and correlation with physical quantities such as hardness ratios and hydrogen column density ($N_H$). The autoencoder accurately reconstructs spectra with 8 latent variables. Clustering in the latent space yields a balanced classification accuracy of $\sim$40% across the 8 source classes, increasing to $\sim$69% when restricted to AGNs and stellar-mass compact objects exclusively. Moreover, latent features correlate with non-linear combinations of spectral fluxes, suggesting that the compressed representation encodes physically relevant information. The proposed autoencoder-based pipeline is a powerful tool for the representation and interpretation of X-ray spectra, providing a compact latent space that supports both classification and the estimation of physical properties. This work demonstrates the potential of deep learning for spectral studies and uncovering new patterns in X-ray data.
☆ Near-Optimal Regret-Queue Length Tradeoff in Online Learning for Two-Sided Markets
We study a two-sided market, wherein, price-sensitive heterogeneous customers and servers arrive and join their respective queues. A compatible customer-server pair can then be matched by the platform, at which point, they leave the system. Our objective is to design pricing and matching algorithms that maximize the platform's profit, while maintaining reasonable queue lengths. As the demand and supply curves governing the price-dependent arrival rates may not be known in practice, we design a novel online-learning-based pricing policy and establish its near-optimality. In particular, we prove a tradeoff among three performance metrics: $\tilde{O}(T^{1-\gamma})$ regret, $\tilde{O}(T^{\gamma/2})$ average queue length, and $\tilde{O}(T^{\gamma})$ maximum queue length for $\gamma \in (0, 1/6]$, significantly improving over existing results [1]. Moreover, barring the permissible range of $\gamma$, we show that this trade-off between regret and average queue length is optimal up to logarithmic factors under a class of policies, matching the optimal one as in [2] which assumes the demand and supply curves to be known. Our proposed policy has two noteworthy features: a dynamic component that optimizes the tradeoff between low regret and small queue lengths; and a probabilistic component that resolves the tension between obtaining useful samples for fast learning and maintaining small queue lengths.
comment: 67 pages, 12 figures
☆ TENDE: Transfer Entropy Neural Diffusion Estimation
Transfer entropy measures directed information flow in time series, and it has become a fundamental quantity in applications spanning neuroscience, finance, and complex systems analysis. However, existing estimation methods suffer from the curse of dimensionality, require restrictive distributional assumptions, or need exponentially large datasets for reliable convergence. We address these limitations in the literature by proposing TENDE (Transfer Entropy Neural Diffusion Estimation), a novel approach that leverages score-based diffusion models to estimate transfer entropy through conditional mutual information. By learning score functions of the relevant conditional distributions, TENDE provides flexible, scalable estimation while making minimal assumptions about the underlying data-generating process. We demonstrate superior accuracy and robustness compared to existing neural estimators and other state-of-the-art approaches across synthetic benchmarks and real data.
☆ Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning
Systematic, compositional generalization beyond the training distribution remains a core challenge in machine learning -- and a critical bottleneck for the emergent reasoning abilities of modern language models. This work investigates out-of-distribution (OOD) generalization in Transformer networks using a GSM8K-style modular arithmetic on computational graphs task as a testbed. We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD generalization: (i) input-adaptive recurrence; (ii) algorithmic supervision; (iii) anchored latent representations via a discrete bottleneck; and (iv) an explicit error-correction mechanism. Collectively, these mechanisms yield an architectural approach for native and scalable latent space reasoning in Transformer networks with robust algorithmic generalization capabilities. We complement these empirical results with a detailed mechanistic interpretability analysis that reveals how these mechanisms give rise to robust OOD generalization abilities.
☆ Neural Network approximation power on homogeneous and heterogeneous reaction-diffusion equations
Reaction-diffusion systems represent one of the most fundamental formulations used to describe a wide range of physical, chemical, and biological processes. With the increasing adoption of neural networks, recent research has focused on solving differential equations using machine learning techniques. However, the theoretical foundation explaining why neural networks can effectively approximate such solutions remains insufficiently explored. This paper provides a theoretical analysis of the approximation power of neural networks for one- and two-dimensional reaction-diffusion equations in both homogeneous and heterogeneous media. Building upon the universal approximation theorem, we demonstrate that a two-layer neural network can approximate the one-dimensional reaction-diffusion equation, while a three-layer neural network can approximate its two-dimensional counterpart. The theoretical framework presented here can be further extended to elliptic and parabolic equations. Overall, this work highlights the expressive power of neural networks in approximating solutions to reaction-diffusion equations and related PDEs, providing a theoretical foundation for neural network-based differential equation solvers.
☆ deFOREST: Fusing Optical and Radar satellite data for Enhanced Sensing of Tree-loss
In this paper we develop a deforestation detection pipeline that incorporates optical and Synthetic Aperture Radar (SAR) data. A crucial component of the pipeline is the construction of anomaly maps of the optical data, which is done using the residual space of a discrete Karhunen-Lo\`{e}ve (KL) expansion. Anomalies are quantified using a concentration bound on the distribution of the residual components for the nominal state of the forest. This bound does not require prior knowledge on the distribution of the data. This is in contrast to statistical parametric methods that assume knowledge of the data distribution, an impractical assumption that is especially infeasible for high dimensional data such as ours. Once the optical anomaly maps are computed they are combined with SAR data, and the state of the forest is classified by using a Hidden Markov Model (HMM). We test our approach with Sentinel-1 (SAR) and Sentinel-2 (Optical) data on a $92.19\,km \times 91.80\,km$ region in the Amazon forest. The results show that both the hybrid optical-radar and optical only methods achieve high accuracy that is superior to the recent state-of-the-art hybrid method. Moreover, the hybrid method is significantly more robust in the case of sparse optical data that are common in highly cloudy regions.
☆ Exact Dynamics of Multi-class Stochastic Gradient Descent
We develop a framework for analyzing the training and learning rate dynamics on a variety of high- dimensional optimization problems trained using one-pass stochastic gradient descent (SGD) with data generated from multiple anisotropic classes. We give exact expressions for a large class of functions of the limiting dynamics, including the risk and the overlap with the true signal, in terms of a deterministic solution to a system of ODEs. We extend the existing theory of high-dimensional SGD dynamics to Gaussian-mixture data and a large (growing with the parameter size) number of classes. We then investigate in detail the effect of the anisotropic structure of the covariance of the data in the problems of binary logistic regression and least square loss. We study three cases: isotropic covariances, data covariance matrices with a large fraction of zero eigenvalues (denoted as the zero-one model), and covariance matrices with spectra following a power-law distribution. We show that there exists a structural phase transition. In particular, we demonstrate that, for the zero-one model and the power-law model with sufficiently large power, SGD tends to align more closely with values of the class mean that are projected onto the "clean directions" (i.e., directions of smaller variance). This is supported by both numerical simulations and analytical studies, which show the exact asymptotic behavior of the loss in the high-dimensional limit.
comment: 58 pages, 12 figures
☆ Exploratory Causal Inference in SAEnce
Randomized Controlled Trials are one of the pillars of science; nevertheless, they rely on hand-crafted hypotheses and expensive analysis. Such constraints prevent causal effect estimation at scale, potentially anchoring on popular yet incomplete hypotheses. We propose to discover the unknown effects of a treatment directly from data. For this, we turn unstructured data from a trial into meaningful representations via pretrained foundation models and interpret them via a sparse autoencoder. However, discovering significant causal effects at the neural level is not trivial due to multiple-testing issues and effects entanglement. To address these challenges, we introduce Neural Effect Search, a novel recursive procedure solving both issues by progressive stratification. After assessing the robustness of our algorithm on semi-synthetic experiments, we showcase, in the context of experimental ecology, the first successful unsupervised causal effect identification on a real-world scientific trial.
☆ On the expressivity of sparse maxout networks
We study the expressivity of sparse maxout networks, where each neuron takes a fixed number of inputs from the previous layer and employs a, possibly multi-argument, maxout activation. This setting captures key characteristics of convolutional or graph neural networks. We establish a duality between functions computable by such networks and a class of virtual polytopes, linking their geometry to questions of network expressivity. In particular, we derive a tight bound on the dimension of the associated polytopes, which serves as the central tool for our analysis. Building on this, we construct a sequence of depth hierarchies. While sufficiently deep sparse maxout networks are universal, we prove that if the required depth is not reached, width alone cannot compensate for the sparsity of a fixed indegree constraint.
☆ FedHFT: Efficient Federated Finetuning with Heterogeneous Edge Clients
Fine-tuning pre-trained large language models (LLMs) has become a common practice for personalized natural language understanding (NLU) applications on downstream tasks and domain-specific datasets. However, there are two main challenges: (i) limited and/or heterogeneous data for fine-tuning due to proprietary data confidentiality or privacy requirements, and (ii) varying computation resources available across participating clients such as edge devices. This paper presents FedHFT - an efficient and personalized federated fine-tuning framework to address both challenges. First, we introduce a mixture of masked adapters to handle resource heterogeneity across participating clients, enabling high-performance collaborative fine-tuning of pre-trained language model(s) across multiple clients in a distributed setting, while keeping proprietary data local. Second, we introduce a bi-level optimization approach to handle non-iid data distribution based on masked personalization and client clustering. Extensive experiments demonstrate significant performance and efficiency improvements over various natural language understanding tasks under data and resource heterogeneity compared to representative heterogeneous federated learning methods.
☆ CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations
Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:https://causal-verse.github.io/, Dataset:https://huggingface.co/CausalVerse.
☆ Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games
Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
comment: EMNLP Main 2025
☆ Context-Selective State Space Models: Feedback is All You Need
Transformers, powered by the attention mechanism, are the backbone of most foundation models, yet they suffer from quadratic complexity and difficulties in dealing with long-range dependencies in the input sequence. Recent work has shown that state space models (SSMs) provide an efficient alternative, with the S6 module at the core of the Mamba architecture achieving state-of-the-art results on long-sequence benchmarks. In this paper, we introduce the COFFEE (COntext From FEEdback) model, a novel time-varying SSM that incorporates state feedback to enable context-dependent selectivity, while still allowing for parallel implementation. Whereas the selectivity mechanism of S6 only depends on the current input, COFFEE computes it from the internal state, which serves as a compact representation of the sequence history. This shift allows the model to regulate its dynamics based on accumulated context, improving its ability to capture long-range dependencies. In addition to state feedback, we employ an efficient model parametrization that removes redundancies present in S6 and leads to a more compact and trainable formulation. On the induction head task, COFFEE achieves near-perfect accuracy with two orders of magnitude fewer parameters and training sequences compared to S6. On MNIST, COFFEE largely outperforms S6 within the same architecture, reaching 97% accuracy with only 3585 parameters. These results showcase the role of state feedback as a key mechanism for building scalable and efficient sequence models.
☆ Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training
Geometry-aware optimization algorithms, such as Muon, have achieved remarkable success in training deep neural networks (DNNs). These methods leverage the underlying geometry of DNNs by selecting appropriate norms for different layers and updating parameters via norm-constrained linear minimization oracles (LMOs). However, even within a group of layers associated with the same norm, the local curvature can be heterogeneous across layers and vary dynamically over the course of training. For example, recent work shows that sharpness varies substantially across transformer layers and throughout training, yet standard geometry-aware optimizers impose fixed learning rates to layers within the same group, which may be inefficient for DNN training. In this paper, we introduce a noise-adaptive layerwise learning rate scheme on top of geometry-aware optimization algorithms and substantially accelerate DNN training compared to methods that use fixed learning rates within each group. Our method estimates gradient variance in the dual norm induced by the chosen LMO on the fly, and uses it to assign time-varying noise-adaptive layerwise learning rates within each group. We provide a theoretical analysis showing that our algorithm achieves a sharp convergence rate. Empirical results on transformer architectures such as LLaMA and GPT demonstrate that our approach achieves faster convergence than state-of-the-art optimizers.
☆ Conditional Clifford-Steerable CNNs with Complete Kernel Basis for PDE Modeling
Clifford-Steerable CNNs (CSCNNs) provide a unified framework that allows incorporating equivariance to arbitrary pseudo-Euclidean groups, including isometries of Euclidean space and Minkowski spacetime. In this work, we demonstrate that the kernel basis of CSCNNs is not complete, thus limiting the model expressivity. To address this issue, we propose Conditional Clifford-Steerable Kernels, which augment the kernels with equivariant representations computed from the input feature field. We derive the equivariance constraint for these input-dependent kernels and show how it can be solved efficiently via implicit parameterization. We empirically demonstrate an improved expressivity of the resulting framework on multiple PDE forecasting tasks, including fluid dynamics and relativistic electrodynamics, where our method consistently outperforms baseline methods.
☆ PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features
LLM-integrated applications are vulnerable to prompt injection attacks, where an attacker contaminates the input to inject malicious prompts, causing the LLM to follow the attacker's intent instead of the original user's. Existing prompt injection detection methods often have sub-optimal performance and/or high computational overhead. In this work, we propose PIShield, a detection method that is both effective and efficient. Our key observation is that the internal representation of the final token in a prompt-extracted from a specific layer of the LLM, which we term the injection-critical layer-captures distinguishing features between clean and contaminated prompts. Leveraging this insight, we train a simple linear classifier on these internal representations using a labeled set of clean and contaminated prompts. We compare PIShield against 11 baselines across 5 diverse benchmark datasets and 8 prompt injection attacks. The results demonstrate that PIShield is both highly effective and efficient, substantially outperforming existing methods. Additionally, we show that PIShield resists strong adaptive attacks.
comment: The code is available at https://github.com/weizou52/PIShield
☆ REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we demonstrate that expert pruning is a superior strategy for generative tasks. We prove that merging introduces an irreducible error by causing a "functional subspace collapse", due to the loss of the router's independent, input-dependent control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
comment: 26 pages, 8 figures, 7 tables
☆ BitNet Distillation
In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.
comment: 12 pages, 4 figures
☆ Dynamic SBI: Round-free Sequential Simulation-Based Inference with Adaptive Datasets
Simulation-based inference (SBI) is emerging as a new statistical paradigm for addressing complex scientific inference problems. By leveraging the representational power of deep neural networks, SBI can extract the most informative simulation features for the parameters of interest. Sequential SBI methods extend this approach by iteratively steering the simulation process towards the most relevant regions of parameter space. This is typically implemented through an algorithmic structure, in which simulation and network training alternate over multiple rounds. This strategy is particularly well suited for high-precision inference in high-dimensional settings, which are commonplace in physics applications with growing data volumes and increasing model fidelity. Here, we introduce dynamic SBI, which implements the core ideas of sequential methods in a round-free, asynchronous, and highly parallelisable manner. At its core is an adaptive dataset that is iteratively transformed during inference to resemble the target observation. Simulation and training proceed in parallel: trained networks are used both to filter out simulations incompatible with the data and to propose new, more promising ones. Compared to round-based sequential methods, this asynchronous structure can significantly reduce simulation costs and training overhead. We demonstrate that dynamic SBI achieves significant improvements in simulation and training efficiency while maintaining inference performance. We further validate our framework on two challenging astrophysical inference tasks: characterising the stochastic gravitational wave background and analysing strong gravitational lensing systems. Overall, this work presents a flexible and efficient new paradigm for sequential SBI.
comment: 15 pages, 5 figures, software available at: https://github.com/dynamic-sbi
Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models
Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.
comment: 11 pages, 7 figures, 8 tables. To be published in Applied AI Letters
☆ Signature in Code Backdoor Detection, how far are we?
As Large Language Models (LLMs) become increasingly integrated into software development workflows, they also become prime targets for adversarial attacks. Among these, backdoor attacks are a significant threat, allowing attackers to manipulate model outputs through hidden triggers embedded in training data. Detecting such backdoors remains a challenge, and one promising approach is the use of Spectral Signature defense methods that identify poisoned data by analyzing feature representations through eigenvectors. While some prior works have explored Spectral Signatures for backdoor detection in neural networks, recent studies suggest that these methods may not be optimally effective for code models. In this paper, we revisit the applicability of Spectral Signature-based defenses in the context of backdoor attacks on code models. We systematically evaluate their effectiveness under various attack scenarios and defense configurations, analyzing their strengths and limitations. We found that the widely used setting of Spectral Signature in code backdoor detection is often suboptimal. Hence, we explored the impact of different settings of the key factors. We discovered a new proxy metric that can more accurately estimate the actual performance of Spectral Signature without model retraining after the defense.
comment: 20 pages, 3 figures
☆ Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems
Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.
comment: 8 pages
☆ Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems ICLR 2025
Recovering true signals from noisy measurements is a central challenge in inverse problems spanning medical imaging, geophysics, and signal processing. Current solutions balance prior assumptions regarding the true signal (regularization) with agreement to noisy measured data (data-fidelity). Conventional data-fidelity loss functions, such as mean-squared error (MSE) or negative log-likelihood, seek pointwise agreement with noisy measurements, often leading to overfitting to noise. In this work, we instead evaluate data-fidelity collectively by testing whether the observed measurements are statistically consistent with the noise distributions implied by the current estimate. We adopt this aggregated perspective and introduce distributional consistency (DC) loss, a data-fidelity objective that replaces pointwise matching with distribution-level calibration using model-based probability scores for each measurement. DC loss acts as a direct and practical plug-in replacement for standard data consistency terms: i) it is compatible with modern regularizers, ii) it is optimized in the same way as traditional losses, and iii) it avoids overfitting to measurement noise even without the use of priors. Its scope naturally fits many practical inverse problems where the measurement-noise distribution is known and where the measured dataset consists of many independent noisy values. We demonstrate efficacy in two key example application areas: i) in image denoising with deep image prior, using DC instead of MSE loss removes the need for early stopping and achieves higher PSNR; ii) in medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances the efficacy of hand-crafted regularization. These results position DC loss as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for inverse problems.
comment: Preprint; submitted to ICLR 2025 for possible publication
♻ ☆ EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels
Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at https://github.com/KPeng9510/ERELIFM.
comment: The source code is available at https://github.com/KPeng9510/ERELIFM
♻ ☆ Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand
Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce fine-resolution maps that support decision-making in long-term water management.
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%~70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.
♻ ☆ BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals NeurIPS 2025
Electroencephalography (EEG) and magnetoencephalography (MEG) measure neural activity non-invasively by capturing electromagnetic fields generated by dendritic currents. Although rooted in the same biophysics, EEG and MEG exhibit distinct signal patterns, further complicated by variations in sensor configurations across modalities and recording devices. Existing approaches typically rely on separate, modality- and dataset-specific models, which limits the performance and cross-domain scalability. This paper proposes BrainOmni, the first brain foundation model that generalises across heterogeneous EEG and MEG recordings. To unify diverse data sources, we introduce BrainTokenizer,the first tokenizer that quantises spatiotemporal brain activity into discrete representations. Central to BrainTokenizer is a novel Sensor Encoder that encodes sensor properties such as spatial layout, orientation, and type, enabling compatibility across devices and modalities. Building upon the discrete representations, BrainOmni learns unified semantic embeddings of brain signals by self-supervised pretraining. To the best of our knowledge, it is the first foundation model to support both EEG and MEG signals, as well as the first to incorporate large-scale MEG pretraining. A total of 1,997 hours of EEG and 656 hours of MEG data are curated and standardised from publicly available sources for pretraining. Experiments show that BrainOmni outperforms both existing foundation models and state-of-the-art task-specific models on a range of downstream tasks. It also demonstrates strong generalisation to unseen EEG and MEG devices. Further analysis reveals that joint EEG-MEG (EMEG) training yields consistent improvements across both modalities. Code and checkpoints are publicly available at https://github.com/OpenTSLab/BrainOmni.
comment: Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Tensor Logic: The Language of AI
Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP an Prolog lack scalability and support for learning. This paper proposes tensor logic, a language that solves these problems by unifying neural and symbolic AI at a fundamental level. The sole construct in tensor logic is the tensor equation, based on the observation that logical rules and Einstein summation are essentially the same operation, and all else can be reduced to them. I show how to elegantly implement key forms of neural, symbolic and statistical AI in tensor logic, including transformers, formal reasoning, kernel machines and graphical models. Most importantly, tensor logic makes new directions possible, such as sound reasoning in embedding space. This combines the scalability and learnability of neural networks with the reliability and transparency of symbolic reasoning, and is potentially a basis for the wider adoption of AI.
comment: 17 pages, 0 figures
♻ ☆ Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series NeurIPS 2025
Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://github.com/blacksnail789521/Time-IMM, and the benchmark library can be accessed at https://github.com/blacksnail789521/IMM-TSF. Project page: https://blacksnail789521.github.io/time-imm-project-page/
comment: This paper has been accepted by the NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.
♻ ☆ Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning
Mamba, a recently proposed linear-time sequence model, has attracted significant attention for its computational efficiency and strong empirical performance. However, a rigorous theoretical understanding of its underlying mechanisms remains limited. In this work, we provide a theoretical analysis of Mamba's in-context learning (ICL) capability by focusing on tasks defined by low-dimensional nonlinear target functions. Specifically, we study in-context learning of a single-index model $y \approx g_*(\langle \boldsymbol{\beta}, \boldsymbol{x} \rangle)$, which depends on only a single relevant direction $\boldsymbol{\beta}$, referred to as feature. We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning, extracting the relevant direction directly from context examples. Consequently, we establish a test-time sample complexity that improves upon linear Transformers -- analyzed to behave like kernel methods -- and is comparable to nonlinear Transformers, which have been shown to surpass the Correlational Statistical Query (CSQ) lower bound and achieve near information-theoretically optimal rate in previous works. Our analysis reveals the crucial role of the nonlinear gating mechanism in Mamba for feature extraction, highlighting it as the fundamental driver behind Mamba's ability to achieve both computational efficiency and high performance.
comment: 34 pages
♻ ☆ Y-shaped Generative Flows
Modern continuous-time generative models often induce V-shaped transport: each sample travels independently along nearly straight trajectories from prior to data, overlooking shared structure. We introduce Y-shaped generative flows, which move probability mass together along shared pathways before branching to target-specific endpoints. Our formulation is based on novel velocity-powered objective with a sublinear exponent (between zero and one). this concave dependence rewards joint and fast mass movement. Practically, we instantiate the idea in a scalable neural ODE training objective. On synthetic, image, and biology datasets, Y-flows recover hierarchy-aware structure, improve distributional metrics over strong flow-based baselines, and reach targets with fewer integration steps.
♻ ☆ h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning
Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. It also transfers significantly to diverse out-of-distribution ReasoningGym domains and long-context benchmarks, indicating broader generalization. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.
comment: Preprint, 31 pages, 8 figures, long-horizon reasoning
♻ ☆ Learning Inter-Atomic Potentials without Explicit Equivariance
Accurate and scalable machine-learned inter-atomic potentials (MLIPs) are essential for molecular simulations ranging from drug discovery to new material design. Current state-of-the-art models enforce roto-translational symmetries through equivariant neural network architectures, a hard-wired inductive bias that can often lead to reduced flexibility, computational efficiency, and scalability. In this work, we introduce TransIP: Transformer-based Inter-Atomic Potentials, a novel training paradigm for interatomic potentials achieving symmetry compliance without explicit architectural constraints. Our approach guides a generic non-equivariant Transformer-based model to learn SO(3)-equivariance by optimizing its representations in the embedding space. Trained on the recent Open Molecules (OMol25) collection, a large and diverse molecular dataset built specifically for MLIPs and covering different types of molecules (including small organics, biomolecular fragments, and electrolyte-like species), TransIP effectively learns symmetry in its latent space, providing low equivariance error. Further, compared to a data augmentation baseline, TransIP achieves 40% to 60% improvement in performance across varying OMol25 dataset sizes. More broadly, our work shows that learned equivariance can be a powerful and efficient alternative to augmentation-based MLIP models.
comment: 19 pages, 3 tables, 10 figures. Under review. Changes from v1 to v2: Clarified concluding phrases in the abstract and introduction, and corrected a single typo in Table 1's total energy MAE reported for eSEN-sm-d
♻ ☆ Flattening Hierarchies with Policy Bootstrapping NeurIPS 2025
Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/
comment: NeurIPS 2025 (Spotlight, top 3.2%)
♻ ☆ Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study
Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.
♻ ☆ Steady Continuous Monitoring is (Just Barely) Impossible for Tests of Unbounded Length
AB testing evaluates the difference between a control and a treatment in a statistically rigorous manner. Continuous monitoring allows statistical evaluation of an AB test as it proceeds. One goal of continuous monitoring is early stopping -- confirming a statistically significant difference between control and treatment as soon as possible. Another goal is to maintain some statistical capability to discover significant differences later in the test if they cannot be confirmed earlier. These goals are in conflict -- looser requirements for early stopping leave us with more stringent ones for later. This paper shows that it is impossible to maintain a constant requirement for significance for tests that have no a priori stopping time, but we can come arbitrarily close to that goal by using tests that require repeated significant results to con rm statistically significant differences between treatment and control.
♻ ☆ Egocentric Visual Navigation through Hippocampal Sequences
Sequential activation of place-tuned neurons in an animal during navigation is typically interpreted as reflecting the sequence of input from adjacent positions along the trajectory. More recent theories about such place cells suggest sequences arise from abstract cognitive objectives like planning. Here, we propose a mechanistic and parsimonious interpretation to complement these ideas: hippocampal sequences arise from intrinsic recurrent circuitry that propagates activity without readily available input, acting as a temporal memory buffer for extremely sparse inputs. We implement a minimal sequence generator inspired by neurobiology and pair it with an actor-critic learner for egocentric visual navigation. Our agent reliably solves a continuous maze without explicit geometric cues, with performance depending on the length of the recurrent sequence. Crucially, the model outperforms LSTM cores under sparse input conditions (16 channels, ~2.5% activity), but not under dense input, revealing a strong interaction between representational sparsity and memory architecture. In contrast to LSTM agents, hidden sequence units develop localized place fields, distance-dependent spatial kernels, and task-dependent remapping, while inputs orthogonalize and spatial information increases across layers. These phenomena align with neurobiological data and are causal to performance. Together, our results show that sparse input synergizes with sequence-generating dynamics, providing both a mechanistic account of place cell sequences in the mammalian hippocampus and a simple inductive bias for reinforcement learning based on sparse egocentric inputs in navigation tasks.
comment: 20 pages, 21 figures. This is a conference submission
♻ ☆ Learning with Incomplete Context: Linear Contextual Bandits with Pretrained Imputation
The rise of large-scale pretrained models has made it feasible to generate predictive or synthetic features at low cost, raising the question of how to incorporate such surrogate predictions into downstream decision-making. We study this problem in the setting of online linear contextual bandits, where contexts may be complex, nonstationary, and only partially observed. In addition to bandit data, we assume access to an auxiliary dataset containing fully observed contexts--common in practice since such data are collected without adaptive interventions. We propose PULSE-UCB, an algorithm that leverages pretrained models trained on the auxiliary data to impute missing features during online decision-making. We establish regret guarantees that decompose into a standard bandit term plus an additional component reflecting pretrained model quality. In the i.i.d. context case with H\"older-smooth missing features, PULSE-UCB achieves near-optimal performance, supported by matching lower bounds. Our results quantify how uncertainty in predicted contexts affects decision quality and how much historical data is needed to improve downstream learning.
♻ ☆ Teaching Models to Understand (but not Generate) High-risk Data
Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.
♻ ☆ ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning NeurIPS 2025
Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI. Our code is publicly available at https://github.com/chaudatascience/cha_mae_vit.
comment: Accepted to NeurIPS 2025
♻ ☆ Thinking in Groups: Permutation Tests Reveal Near-Out-of-Distribution
Deep neural networks (DNNs) have the potential to power many biomedical workflows, but training them on truly representative, IID datasets is often infeasible. Most models instead rely on biased or incomplete data, making them prone to out-of-distribution (OoD) inputs that closely resemble in-distribution samples. Such near-OoD cases are harder to detect than standard OOD benchmarks and can cause unreliable, even catastrophic, predictions. Biomedical assays, however, offer a unique opportunity: they often generate multiple correlated measurements per specimen through biological or technical replicates. Exploiting this insight, we introduce Homogeneous OoD (HOoD), a novel OoD detection framework for correlated data. HOoD projects groups of correlated measurements through a trained model and uses permutation-based hypothesis tests to compare them with known subpopulations. Each test yields an interpretable p-value, quantifying how well a group matches a subpopulation. By aggregating these p-values, HOoD reliably identifies OoD groups. In evaluations, HOoD consistently outperforms point-wise and ensemble-based OoD detectors, demonstrating its promise for robust real-world deployment.
♻ ☆ Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis
Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.
♻ ☆ Challenges and proposed solutions in modeling multimodal data: A systematic review
Multimodal data modeling has emerged as a powerful approach in clinical research, enabling the integration of diverse data types such as imaging, genomics, wearable sensors, and electronic health records. Despite its potential to improve diagnostic accuracy and support personalized care, modeling such heterogeneous data presents significant technical challenges. This systematic review synthesizes findings from 69 studies to identify common obstacles, including missing modalities, limited sample sizes, dimensionality imbalance, interpretability issues, and finding the optimal fusion techniques. We highlight recent methodological advances, such as transfer learning, generative models, attention mechanisms, and neural architecture search that offer promising solutions. By mapping current trends and innovations, this review provides a comprehensive overview of the field and offers practical insights to guide future research and development in multimodal modeling for medical applications.
♻ ☆ A Personalized Data-Driven Generative Model of Human Repetitive Motion
The deployment of autonomous virtual avatars (in extended reality) and robots in human group activities -- such as rehabilitation therapy, sports, and manufacturing -- is expected to increase as these technologies become more pervasive. Designing cognitive architectures and control strategies to drive these agents requires realistic models of human motion. Furthermore, recent research has shown that each person exhibits a unique velocity signature, highlighting how individual motor behaviors are both rich in variability and internally consistent. However, existing models only provide simplified descriptions of human motor behavior, hindering the development of effective cognitive architectures. In this work, we first show that motion amplitude provides a valid and complementary characterization of individual motor signatures. Then, we propose a fully data-driven approach, based on long short-term memory neural networks, to generate original motion that captures the unique features of specific individuals. We validate the architecture using real human data from participants performing spontaneous oscillatory motion. Extensive analyses show that state-of-the-art Kuramoto-like models fail to replicate individual motor signatures, whereas our model accurately reproduces the velocity distribution and amplitude envelopes of the individual it was trained on, while remaining distinct from others.
comment: 12 pages, 6 figures
♻ ☆ Multi-Variable Batch Bayesian Optimization in Materials Research: Synthetic Data Analysis of Noise Sensitivity and Problem Landscape Effects
Bayesian Optimization (BO) machine learning method is increasingly used to guide experimental optimization tasks in materials science. To emulate the large number of input variables and noise-containing results in experimental materials research, we perform batch BO simulation of six design variables with a range of noise levels. Two test cases relevant for materials science problems are examined: a needle-in-a-haystack case (Ackley function) that may be encountered in, e.g., molecule optimizations, and a smooth landscape with a local optimum in addition to the global optimum (Hartmann function) that may be encountered in, e.g., material composition optimization. We show learning curves, performance metrics, and visualization to effectively track the optimization progression and evaluate how the optimization outcomes are affected by noise, batch-picking method, choice of acquisition function, and exploration hyperparameter values. We find that the effects of noise depend on the problem landscape: noise degrades the optimization results of a needle-in-a-haystack search (Ackley) dramatically more. However, with increasing noise, we observe an increasing probability of landing on the local optimum in Hartmann. Therefore, prior knowledge of the problem domain structure and noise level is essential when designing BO for materials research experiments. Synthetic data studies -- with known ground truth and controlled noise levels -- enable us to isolate and evaluate the impact of different batch BO components, {\it e.g.}, acquisition policy, objective metrics, and hyperparameter values, before transitioning to the inherent uncertainties of real experimental systems. The results and methodology of this study will facilitate a greater utilization of BO in guiding experimental materials research, specifically in settings with a large number of design variables to optimize.
♻ ☆ Quantize What Counts: More for Keys, Less for Values
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
♻ ☆ LLM-Enabled In-Context Learning for Data Collection Scheduling in UAV-assisted Sensor Networks
Unmanned Aerial Vehicles (UAVs) are increasingly being utilized in various private and commercial applications, e.g., traffic control, parcel delivery, and Search and Rescue (SAR) missions. Machine Learning (ML) methods used in UAV-Assisted Sensor Networks (UASNETs) and, especially, in Deep Reinforcement Learning (DRL) face challenges such as complex and lengthy model training, gaps between simulation and reality, and low sampling efficiency, which conflict with the urgency of emergencies, such as SAR missions. In this paper, an In-Context Learning (ICL)-Data Collection Scheduling (ICLDC) system is proposed as an alternative to DRL in emergencies. The UAV collects sensory data and transmits it to a Large Language Model (LLM), which creates a task description in natural language. From this description, the UAV receives a data collection schedule that must be executed. A verifier ensures safe UAV operations by evaluating the schedules generated by the LLM and overriding unsafe schedules based on predefined rules. The system continuously adapts by incorporating feedback into the task descriptions and using this for future decisions. This method is tested against jailbreaking attacks, where the task description is manipulated to undermine network performance, highlighting the vulnerability of LLMs to such attacks. The proposed ICLDC significantly reduces cumulative packet loss compared to both the DQN and Maximum Channel Gain baselines. ICLDC presents a promising direction for intelligent scheduling and control in UASNETs.
♻ ☆ Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers
Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. This approach formed the basis of our winning solution to the Standard Industries \$1 million Retrosynthesis Challenge. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.
♻ ☆ I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2
Large language models demonstrate proficiency on phonetic tasks, such as rhyming, without explicit phonetic or auditory grounding. In this work, we investigate how \verb|Llama-3.2-1B-Instruct| represents token-level phonetic information. Our results suggest that Llama uses a rich internal model of phonemes to complete phonetic tasks. We provide evidence for high-level organization of phoneme representations in its latent space. In doing so, we also identify a ``phoneme mover head" which promotes phonetic information during rhyming tasks. We visualize the output space of this head and find that, while notable differences exist, Llama learns a model of vowels similar to the standard IPA vowel chart for humans, despite receiving no direct supervision to do so.
♻ ☆ The quest for the GRAph Level autoEncoder (GRALE)
Although graph-based learning has attracted a lot of attention, graph representation learning is still a challenging task whose resolution may impact key application fields such as chemistry or biology. To this end, we introduce GRALE, a novel graph autoencoder that encodes and decodes graphs of varying sizes into a shared embedding space. GRALE is trained using an Optimal Transport-inspired loss that compares the original and reconstructed graphs and leverages a differentiable node matching module, which is trained jointly with the encoder and decoder. The proposed attention-based architecture relies on Evoformer, the core component of AlphaFold, which we extend to support both graph encoding and decoding. We show, in numerical experiments on simulated and molecular data, that GRALE enables a highly general form of pre-training, applicable to a wide range of downstream tasks, from classification and regression to more complex tasks such as graph interpolation, editing, matching, and prediction.
♻ ☆ NEUROLOGIC: From Neural Representations to Interpretable Logic Rules
Rule-based explanation methods offer rigorous and globally interpretable insights into neural network behavior. However, existing approaches are mostly limited to small fully connected networks and depend on costly layerwise rule extraction and substitution processes. These limitations hinder their generalization to more complex architectures such as Transformers. Moreover, existing methods produce shallow, decision-tree-like rules that fail to capture rich, high-level abstractions in complex domains like computer vision and natural language processing. To address these challenges, we propose NEUROLOGIC, a novel framework that extracts interpretable logical rules directly from deep neural networks. Unlike previous methods, NEUROLOGIC can construct logic rules over hidden predicates derived from neural representations at any chosen layer, in contrast to costly layerwise extraction and rewriting. This flexibility enables broader architectural compatibility and improved scalability. Furthermore, NEUROLOGIC supports richer logical constructs and can incorporate human prior knowledge to ground hidden predicates back to the input space, enhancing interpretability. We validate NEUROLOGIC on Transformer-based sentiment analysis, demonstrating its ability to extract meaningful, interpretable logic rules and provide deeper insights-tasks where existing methods struggle to scale.
comment: 16 pages, 9 figures
♻ ☆ Leveraging Nested MLMC for Sequential Neural Posterior Estimation with Intractable Likelihoods
There is a growing interest in studying sequential neural posterior estimation (SNPE) techniques due to their advantages for simulation-based models with intractable likelihoods. The methods aim to learn the posterior from adaptively proposed simulations using neural network-based conditional density estimators. As an SNPE technique, the automatic posterior transformation (APT) method proposed by Greenberg et al. (2019) performs well and scales to high-dimensional data. However, the APT method requires computing the expectation of the logarithm of an intractable normalizing constant, i.e., a nested expectation. Although atomic proposals were used to render an analytical normalizing constant, it remains challenging to analyze the convergence of learning. In this paper, we reformulate APT as a nested estimation problem. Building on this, we construct several multilevel Monte Carlo (MLMC) estimators for the loss function and its gradients to accommodate different scenarios, including two unbiased estimators, and a biased estimator that trades a small bias for reduced variance and controlled runtime and memory usage. We also provide convergence results of stochastic gradient descent to quantify the interaction of the bias and variance of the gradient estimator. Numerical experiments for approximating complex posteriors with multimodality in moderate dimensions are provided to examine the effectiveness of the proposed methods.
comment: 30 pages, 6 figures
♻ ☆ Function regression using the forward forward training and inferring paradigm
Function regression/approximation is a fundamental application of machine learning. Neural networks (NNs) can be easily trained for function regression using a sufficient number of neurons and epochs. The forward-forward learning algorithm is a novel approach for training neural networks without backpropagation, and is well suited for implementation in neuromorphic computing and physical analogs for neural networks. To the best of the authors' knowledge, the Forward Forward paradigm of training and inferencing NNs is currently only restricted to classification tasks. This paper introduces a new methodology for approximating functions (function regression) using the Forward-Forward algorithm. Furthermore, the paper evaluates the developed methodology on univariate and multivariate functions, and provides preliminary studies of extending the proposed Forward-Forward regression to Kolmogorov Arnold Networks, and Deep Physical Neural Networks.
comment: Keywords: Neural Networks, Forward Forward training, Function Regression, Physical Neural Networks, Analog Computing
♻ ☆ Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning
While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally correlated states are properly encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. We formalize this notion by demonstrating how encouraging long-range temporal consistency via successor representations (SR) can facilitate generalization. We then propose a simple yet effective representation learning objective, $\text{BYOL-}\gamma$ for GCBC, which theoretically approximates the successor representation in the finite MDP case through self-predictive representations, and achieves competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.
♻ ☆ FLARE: Fast Low-rank Attention Routing Engine
The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among $N$ tokens by projecting the input sequence onto a fixed length latent sequence of $M \ll N$ tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at $O(NM)$ cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.
♻ ☆ Conditional Flow Matching for Bayesian Posterior Inference
We propose a generative multivariate posterior sampler via flow matching. It offers a simple training objective, and does not require access to likelihood evaluation. The method learns a dynamic, block-triangular velocity field in the joint space of data and parameters, which results in a deterministic transport map from a source distribution to the desired posterior. The inverse map, named vector rank, is accessible by reversibly integrating the velocity over time. It is advantageous to leverage the dynamic design: proper constraints on the velocity yield a monotone map, which leads to a conditional Brenier map, enabling a fast and simultaneous generation of Bayesian credible sets whose contours correspond to level sets of Monge-Kantorovich data depth. Our approach is computationally lighter compared to GAN-based and diffusion-based counterparts, and is capable of capturing complex posterior structures. Finally, frequentist theoretical guarantee on the consistency of the recovered posterior distribution, and of the corresponding Bayesian credible sets, is provided.
♻ ☆ Effect Identification and Unit Categorization in the Multi-Score Regression Discontinuity Design with Application to LED Manufacturing
RDD (Regression discontinuity design) is a widely used framework for identifying and estimating causal effects at the cutoff of a single running variable. In practice, however, decision-making often involves multiple thresholds and criteria, especially in production systems. Standard MRD (multi-score RDD) methods address this complexity by reducing the problem to a one-dimensional design. This simplification allows existing approaches to be used to identify and estimate causal effects, but it can introduce non-compliance by misclassifying units relative to the original cutoff rules. We develop theoretical tools to detect and reduce "fuzziness" when estimating the cutoff effect for units that comply with individual subrules of a multi-rule system. In particular, we propose a formal definition and categorization of unit behavior types under multi-dimensional cutoff rules, extending standard classifications of compliers, alwaystakers, and nevertakers, and incorporating defiers and indecisive units. We further identify conditions under which cutoff effects for compliers can be estimated in multiple dimensions, and establish when identification remains valid after excluding nevertakers and alwaystakers. In addition, we examine how decomposing complex Boolean cutoff rules (such as AND- and OR-type rules) into simpler components affects the classification of units into behavioral types and improves estimation by making it possible to identify and remove non-compliant units more accurately. We validate our framework using both semi-synthetic simulations calibrated to production data and real-world data from opto-electronic semiconductor manufacturing. The empirical results demonstrate that our approach has practical value in refining production policies and reduces estimation variance. This underscores the usefulness of the MRD framework in manufacturing contexts.
♻ ☆ Hierarchical Evaluation Function: A Multi-Metric Approach for Optimizing Demand Forecasting Models
Demand forecasting in competitive and uncertain business environments requires models that can integrate multiple evaluation perspectives, rather than being restricted to hyperparameter optimization through a single metric. This traditional approach tends to prioritize one error indicator, which can bias results when metrics provide contradictory signals. In this context, the Hierarchical Evaluation Function (HEF) is proposed as a multi-metric framework for hyperparameter optimization that integrates explanatory power (R2), sensitivity to extreme errors (RMSE), and average accuracy (MAE). The performance of HEF was assessed using four widely recognized benchmark datasets in the forecasting domain: the Walmart, M3, M4, and M5 datasets. Prediction models were optimized through Grid Search, Particle Swarm Optimization (PSO), and Optuna, and statistical analyses based on difference-of-proportions tests confirmed that HEF delivers superior results compared to a unimetric reference function, regardless of the optimizer employed, with particular relevance for heterogeneous monthly time series (M3) and highly granular daily demand scenarios (M5). The findings demonstrate that HEF improves stability, generalization, and robustness at a low computational cost, consolidating its role as a reliable evaluation framework that enhances model selection, enables more accurate demand forecasts, and supports decision-making in dynamic and competitive business environments.
comment: 31 pages, 15 figures, 25 tables. Submitted as a preprint. The manuscript introduces the Hierarchical Evaluation Function, a multi-metric framework for optimizing demand forecasting models under high uncertainty. Includes extensive experimental validation using real-world datasets and a comparative analysis against classical and modern methods
♻ ☆ The Bayesian Approach to Continual Learning: An Overview
Continual learning is an online paradigm where a learner continually accumulates knowledge from different tasks encountered over sequential time steps. Importantly, the learner is required to extend and update its knowledge without forgetting about the learning experience acquired from the past, and while avoiding the need to retrain from scratch. Given its sequential nature and its resemblance to the way humans think, continual learning offers an opportunity to address several challenges which currently stand in the way of widening the range of applicability of deep models to further real-world problems. The continual need to update the learner with data arriving sequentially strikes inherent congruence between continual learning and Bayesian inference which provides a principal platform to keep updating the prior beliefs of a model given new data, without completely forgetting the knowledge acquired from the old data. This survey inspects different settings of Bayesian continual learning, namely task-incremental learning and class-incremental learning. We begin by discussing definitions of continual learning along with its Bayesian setting, as well as the links with related fields, such as domain adaptation, transfer learning and meta-learning. Afterwards, we introduce a taxonomy offering a comprehensive categorization of algorithms belonging to the Bayesian continual learning paradigm. Meanwhile, we analyze the state-of-the-art while zooming in on some of the most prominent Bayesian continual learning algorithms to date. Furthermore, we shed some light on links between continual learning and developmental psychology, and correspondingly introduce analogies between both fields. We follow that with a discussion of current challenges, and finally conclude with potential areas for future research on Bayesian continual learning.
♻ ☆ Variational Reasoning for Language Models
We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
♻ ☆ Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning
Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
♻ ☆ CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
♻ ☆ Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data
Randomized trials are typically designed to detect average treatment effects but often lack the statistical power to uncover individual-level treatment effect heterogeneity, limiting their value for personalized decision-making. To address this, we propose the QR-learner, a model-agnostic learner that estimates conditional average treatment effects (CATE) within the trial population by leveraging external data from other trials or observational studies. The proposed method is robust: it can reduce the mean squared error relative to a trial-only CATE learner, and is guaranteed to recover the true CATE even when the external data are not aligned with the trial. Moreover, we introduce a procedure that combines the QR-learner with a trial-only CATE learner and show that it asymptotically matches or exceeds both component learners in terms of mean squared error. We examine the performance of our approach in simulation studies and apply the methods to a real-world dataset, demonstrating improvements in both CATE estimation and statistical power for detecting heterogeneous effects.
comment: 22 pages, including references and appendix
♻ ☆ MACTAS: Self-Attention-Based Module for Inter-Agent Communication in Multi-Agent Reinforcement Learning AAMAS 2026
Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi-agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non-differentiable. In this work, we introduce a self-attention-based communication module that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in a reward-driven manner. The module can be seamlessly integrated with any action-value function decomposition method and can be viewed as an extension of such decompositions. Notably, it includes a fixed number of trainable parameters, independent of the number of agents. Experimental results on the SMAC and SMACv2 benchmarks demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on a number of maps.
comment: Submitted for AAMAS 2026
♻ ☆ Optimal Quantization for Matrix Multiplication
Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goal of these new compression algorithms is to be able to approximate not the matrices themselves, but their matrix product. Specifically, given a pair of real matrices $A,B$ an encoder (compressor) is applied to each of them independently producing descriptions with $R$ bits per entry. These representations subsequently are used by the decoder to estimate matrix product $A^\top B$. In this work, we provide a non-asymptotic lower bound on the mean squared error of this approximation (as a function of rate $R$) for the case of matrices $A,B$ with iid Gaussian entries. Algorithmically, we construct a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices $A$, $B$ in terms of only Frobenius norms $\|\bar{A}\|_F, \|\bar{B}\|_F$ and $\|\bar{A}^\top \bar{B}\|_F$, where $\bar{A},\bar{B}$ are versions of $A,B$ with zero-centered columns, respectively. For iid Gaussian matrices our quantizer achieves the lower bound and is, thus, asymptotically optimal. A practical low-complexity version of our quantizer achieves performance quite close to optimal. In addition, we derive rate-distortion function for matrix multiplication of iid Gaussian matrices, which exhibits an interesting phase-transition at $R\approx 0.906$ bit/entry, showing necessity of Johnson-Lindestrauss dimensionality reduction (sketching) in the low-rate regime.
♻ ☆ Interventional Processes for Causal Uncertainty Quantification
Reliable uncertainty quantification for causal effects is crucial in various applications, but remains difficult in nonparametric models, particularly for continuous treatments. We introduce IMPspec, a Gaussian process (GP) framework for modeling uncertainty over interventional causal functions under continuous treatments, which can be represented using reproducing Kernel Hilbert Spaces (RKHSs). By using principled function class expansions and a spectral representation of RKHS features, IMPspec yields tractable training and inference, a spectral algorithm to calibrate posterior credible intervals, and avoids the underfitting and variance collapse pathologies of earlier GP-on-RKHS methods. Across synthetic benchmarks and an application in healthcare, IMPspec delivers state-of-the-art performance in causal uncertainty quantification and downstream causal Bayesian optimization tasks.
♻ ☆ Do LLM Agents Have Regret? A Case Study in Online Learning and Games ICLR 2025
Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of \emph{regret}. We first empirically study the {no-regret} behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel \emph{unsupervised} training loss of \emph{regret-loss}, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above ``regrettable'' cases.
comment: Camera ready version of ICLR 2025
♻ ☆ Multi-state Protein Design with DynamicMPNN
Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes - from enzyme catalysis to membrane transport - depend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using Alphafold 3, DynamicMPNN outperforms ProteinMPNN by up to 25% on decoy-normalized RMSD and by 12% on sequence recovery across our challenging multi-state protein benchmark.
♻ ☆ Path Gradients after Flow Matching
Boltzmann Generators have emerged as a promising machine learning tool for generating samples from equilibrium distributions of molecular systems using Normalizing Flows and importance weighting. Recently, Flow Matching has helped speed up Continuous Normalizing Flows (CNFs), scale them to more complex molecular systems, and minimize the length of the flow integration trajectories. We investigate the benefits of using path gradients to fine-tune CNFs initially trained by Flow Matching, in the setting where a target energy is known. Our experiments show that this hybrid approach yields up to a threefold increase in sampling efficiency for molecular systems, all while using the same model, a similar computational budget and without the need for additional sampling. Furthermore, by measuring the length of the flow trajectories during fine-tuning, we show that path gradients largely preserve the learned structure of the flow.
♻ ☆ Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization
Human cognitive behavior arises from the interaction of specialized brain networks dedicated to distinct functions, such as language, logic, and social reasoning. Inspired by this organization, we propose Mixture of Cognitive Reasoners (MiCRo): a modular, transformer-based architecture post-trained with a curriculum that induces functional specialization across experts. Concretely, we partition the layers of a pretrained language model into four expert modules aligned with well-studied cognitive networks in the human brain. MiCRo offers three key advantages over standard language models. (1) The specialized experts are interpretable and causally meaningful -- ablating a module causes substantial drops on benchmarks requiring its specialized domain. (2) MiCRo's behavior can be dynamically steered at inference time by routing tokens to particular experts (e.g., favoring social over logical reasoning), enabling fine-grained control over outputs. (3) MiCRo outperforms or matches comparable baselines on both machine-learning reasoning benchmarks (e.g., GSM8K, BBH) and alignment to human behavior (CogBench), while maintaining interpretability. Taken together, cognitively grounded functional specialization yields models that are both more human-like and more human-interpretable.
comment: Preprint. Project Page at https://cognitive-reasoners.epfl.ch
♻ ☆ Accelerating Natural Gradient Descent for PINNs with Randomized Nyström Preconditioning
Natural Gradient Descent (NGD) has emerged as a promising optimization algorithm for training neural network-based solvers for partial differential equations (PDEs), such as Physics-Informed Neural Networks (PINNs). However, its practical use is often limited by the high computational cost of solving linear systems involving the Gramian matrix. While matrix-free NGD methods based on the conjugate gradient (CG) method avoid explicit matrix inversion, the ill-conditioning of the Gramian significantly slows the convergence of CG. In this work, we extend matrix-free NGD to broader classes of problems than previously considered and propose the use of Randomized Nystr\"om preconditioning to accelerate convergence of the inner CG solver. The resulting algorithm demonstrates substantial performance improvements over existing NGD-based methods and other state-of-the-art optimizers on a range of PDE problems discretized using neural networks.
♻ ☆ Leveraging Ensemble-Based Semi-Supervised Learning for Illicit Account Detection in Ethereum DeFi Transactions
The advent of smart contracts has enabled the rapid rise of Decentralized Finance (DeFi) on the Ethereum blockchain, offering substantial rewards in financial innovation and inclusivity. This growth, however, is accompanied by significant security risks such as illicit accounts engaged in fraud. Effective detection is further limited by the scarcity of labeled data and the evolving tactics of malicious accounts. To address these challenges with a robust solution for safeguarding the DeFi ecosystem, we propose $\textbf{SLEID}$, a $\textbf{S}$elf-$\textbf{L}$earning $\textbf{E}$nsemble-based $\textbf{I}$llicit account $\textbf{D}$etection framework. SLEID uses an Isolation Forest model for initial outlier detection and a self-training mechanism to iteratively generate pseudo-labels for unlabeled accounts, enhancing detection accuracy. Experiments on 6,903,860 Ethereum transactions with extensive DeFi interaction coverage demonstrate that SLEID significantly outperforms supervised and semi-supervised baselines with $\textbf{+2.56}$ percentage-point precision, comparable recall, and $\textbf{+0.90}$ percentage-point F1 -- particularly for the minority illicit class -- alongside $\textbf{+3.74}$ percentage-points higher accuracy and improvements in PR-AUC, while substantially reducing reliance on labeled data.
comment: 23 pages, 12 figures
♻ ☆ Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation
Research shows that errors in natural language can be corrected by translating texts to another language and back using language models. We explore to what extent this latent correction capability extends to Automated Program Repair (APR) by investigating Round-Trip Translation (RTT): translating code from one programming language into another programming or natural language and back, using Large Language Models (LLMs). We hypothesize that RTT restores patterns most commonly seen in the LLM's training corpora through regression toward the mean, replacing infrequent bugs with more frequent, natural, bug-free code. To test this hypothesis, we employ nine LLMs and four common APR benchmarks in Java, and perform a detailed quantitative and qualitative analysis of RTT-generated patches. We find that RTT through English generates plausible patches for 100 of 164 bugs with GPT-4 on the HumanEval-Java benchmark, and 97 are found to be correct in our manual assessment. Moreover, RTT uniquely generates plausible patches for 46 bugs that were missed by LLMs specifically fine-tuned for APR. While this demonstrates the viability of RTT for APR, we also observe limitations, such as a lower overall bug fix rate than the state-of-the-art and diluting the original coding style. We analyze the impact of these limitations and discuss the potential of using RTT as a complementary component in APR frameworks. A replication package is available for download from https://doi.org/10.5281/zenodo.10500593. Keywords: automated program repair, large language model, machine translation
comment: Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM)
♻ ☆ Intelligent4DSE: Optimizing High-Level Synthesis Design Space Exploration with Graph Neural Networks and Large Language Models
High-Level Synthesis (HLS) Design Space Exploration (DSE) is essential for generating hardware designs that balance performance, power, and area (PPA). To optimize this process, existing works often employs message-passing neural networks (MPNNs) to predict quality of results (QoR). These predictors serve as evaluators in the DSE process, effectively bypassing the time-consuming estimations traditionally required by HLS tools. However, existing models based on MPNNs struggle with over-smoothing and limited expressiveness. Additionally, while meta-heuristic algorithms are widely used in DSE, they typically require extensive domain-specific knowledge to design operators and time-consuming tuning. To address these limitations, we propose ECoGNNs-LLMMHs, a framework that integrates graph neural networks with task-adaptive message passing and large language model-enhanced meta-heuristic algorithms. Compared with state-of-the-art works, ECoGNN exhibits lower prediction error in the post-HLS prediction task, with the error reduced by 57.27\%. For post-implementation prediction tasks, ECoGNN demonstrates the lowest prediction errors, with average reductions of 17.6\% for flip-flop (FF) usage, 33.7\% for critical path (CP) delay, 26.3\% for power consumption, 38.3\% for digital signal processor (DSP) utilization, and 40.8\% for BRAM usage. LLMMH variants can generate superior Pareto fronts compared to meta-heuristic algorithms in terms of average distance from the reference set (ADRS) with average improvements of 87.47\%, respectively. Compared with the SOTA DSE approaches GNN-DSE and IRONMAN-PRO, LLMMH can reduce the ADRS by 68.17\% and 63.07\% respectively.
♻ ☆ Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets
Constructing confidence intervals (CIs) for the average treatment effect (ATE) from patient records is crucial to assess the effectiveness and safety of drugs. However, patient records typically come from different hospitals, thus raising the question of how multiple observational datasets can be effectively combined for this purpose. In our paper, we propose a new method that estimates the ATE from multiple observational datasets and provides valid CIs. Our method makes little assumptions about the observational datasets and is thus widely applicable in medical practice. The key idea of our method is that we leverage prediction-powered inferences and thereby essentially `shrink' the CIs so that we offer more precise uncertainty quantification as compared to na\"ive approaches. We further prove the unbiasedness of our method and the validity of our CIs. We confirm our theoretical results through various numerical experiments. Finally, we provide an extension of our method for constructing CIs from combinations of experimental and observational datasets.
♻ ☆ Normalised clustering accuracy: An asymmetric external cluster validity measure
There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).
♻ ☆ Clustering with minimum spanning trees: How good can it be?
Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.
♻ ☆ Higher-arity PAC learning, VC dimension and packing lemma
The aim of this note is to overview some of our work in Chernikov, Towsner'20 (arXiv:2010.00726) developing higher arity VC theory (VC$_n$ dimension), including a generalization of Haussler packing lemma, and an associated tame (slice-wise) hypergraph regularity lemma; and to demonstrate that it characterizes higher arity PAC learning (PAC$_n$ learning) in $n$-fold product spaces with respect to product measures introduced by Kobayashi, Kuriyama and Takeuchi'15. We also point out how some of the recent results in arXiv:2402.14294, arXiv:2505.15688, arXiv:2509.20404 follow from our work in arXiv:2010.00726.
comment: v.2. Corrected our presentation of PAC_n learning in the sense of Takeuchi et al. in section 4; and slightly improved the PAC_n learning function in Theorem 6.5 to additionally ensure its properness
♻ ☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
♻ ☆ The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels NeurIPS 2025
Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, we believe that delineating their susceptibility to clean-label poisoning, and developing methods for overcoming this susceptibility, are critical research directions to pursue.
comment: Accepted to NeurIPS 2025
♻ ☆ Exact Gauss-Newton Optimization for Training Deep Neural Networks
We present Exact Gauss-Newton (EGN), a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges in expectation to a stationary point of the objective. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, GAF, SQN, and SGN optimizers across various supervised and reinforcement learning tasks.
comment: Accepted to Neurocomputing (2025). This version corresponds to the accepted manuscript; minor formatting edits for arXiv single-column layout (34 pages, 3 figures, 8 tables)
♻ ☆ A metrological framework for uncertainty evaluation in machine learning classification models
Machine learning (ML) classification models are increasingly being used in a wide range of applications where it is important that predictions are accompanied by uncertainties, including in climate and earth observation, medical diagnosis and bioaerosol monitoring. The output of an ML classification model is a type of categorical variable known as a nominal property in the International Vocabulary of Metrology (VIM). However, concepts related to uncertainty evaluation for nominal properties are not defined in the VIM, nor is such evaluation addressed by the Guide to the Expression of Uncertainty in Measurement (GUM). In this paper we propose a metrological conceptual uncertainty evaluation framework for nominal properties. This framework is based on probability mass functions and summary statistics thereof, and it is applicable to ML classification. We also illustrate its use in the context of two applications that exemplify the issues and have significant societal impact, namely, climate and earth observation and medical diagnosis. Our framework would enable an extension of the GUM to uncertainty for nominal properties, which would make both applicable to ML classification models.
comment: 51 pages, 5 figures
♻ ☆ Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
To optimize large Transformer model training, both efficient parallel computing and advanced data management are indispensable. However, current methods often assume a stable and uniform training workload, neglecting data-induced imbalances-arising from both sampling and packing processes-which can impede training performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.
comment: SIGMOD 2026
♻ ☆ Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.
♻ ☆ Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting
The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the causal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, strict causal soundness, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our extensive experiments validate this approach by exposing the critical biases and design limitations of prior benchmarks. Furthermore, we conclusively demonstrate that the causal relevance of textual information is the key factor in unlocking genuine performance gains in multimodal forecasting.
comment: provide the url of our dataset and code
♻ ☆ Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation
AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.
comment: APSIPA ASC 2025 Accepted
♻ ☆ LibEMER: A novel benchmark and algorithms library for EEG-based Multimodal Emotion Recognition
EEG-based multimodal emotion recognition(EMER) has gained significant attention and witnessed notable advancements, the inherent complexity of human neural systems has motivated substantial efforts toward multimodal approaches. However, this field currently suffers from three critical limitations: (i) the absence of open-source implementations. (ii) the lack of standardized and transparent benchmarks for fair performance analysis. (iii) in-depth discussion regarding main challenges and promising research directions is a notable scarcity. To address these challenges, we introduce LibEMER, a unified evaluation framework that provides fully reproducible PyTorch implementations of curated deep learning methods alongside standardized protocols for data preprocessing, model realization, and experimental setups. This framework enables unbiased performance assessment on three widely-used public datasets across two learning tasks. The open-source library is publicly accessible at: https://anonymous.4open.science/r/2025ULUIUBUEUMUEUR485384
comment: 5 pages, 2 figures
♻ ☆ Spatio-Temporal LLM: Reasoning about Environments and Actions
Despite significant recent progress of Multimodal Large Language Models (MLLMs), current MLLMs are challenged by "spatio-temporal" prompts, i.e., prompts that refer to 1) the entirety of an environment encoded in a point cloud that the MLLM should consider; and simultaneously also refer to 2) actions that happened in part of the environment and are encoded in a short ego-centric video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent MLLMs indeed struggle to correctly answer "spatio-temporal" prompts. Building on this dataset, we study two spatio-temporal LLM (STLLM) baselines: 1) STLLM-3D, which directly fuses point cloud, video, and text representations as inputs to the LLM; and 2) STLLM-Aligner, which aligns spatial context with video and text before LLM decoding. Both baselines aim to enhance spatial understanding of environments and temporal grounding of egocentric observations. On REA, the STLLM baselines outperform existing models, demonstrating the effectiveness of our designs. Code and data are available at https://zoezheng126.github.io/STLLM-website/.
comment: Code and data are available at https://zoezheng126.github.io/STLLM-website/
♻ ☆ DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning
Multi-step agentic retrieval systems based on large language models (LLMs) have demonstrated remarkable performance in complex information search tasks. However, these systems still face significant challenges in practical applications, particularly in generating factually inconsistent intermediate queries and inefficient search trajectories, which can lead to reasoning deviations or redundant computations. To address these issues, we propose DynaSearcher, an innovative search agent enhanced by dynamic knowledge graphs and multi-reward reinforcement learning (RL). Specifically, our system leverages knowledge graphs as external structured knowledge to guide the search process by explicitly modeling entity relationships, thereby ensuring factual consistency in intermediate queries and mitigating biases from irrelevant information. Furthermore, we employ a multi-reward RL framework for fine-grained control over training objectives such as retrieval accuracy, efficiency, and response quality. This framework promotes the generation of high-quality intermediate queries and comprehensive final answers, while discouraging unnecessary exploration and minimizing information omissions or redundancy. Experimental results demonstrate that our approach achieves state-of-the-art answer accuracy on six multi-hop question answering datasets, matching frontier LLMs while using only small-scale models and limited computational resources. Furthermore, our approach demonstrates strong generalization and robustness across diverse retrieval environments and larger-scale models, highlighting its broad applicability.
comment: 12 pages, 3 figures, 8 tables
♻ ☆ Improving Transferability of Adversarial Examples via Bayesian Attacks
The transferability of adversarial examples allows for the attack on unknown deep neural networks (DNNs), posing a serious threat to many applications and attracting great attention. In this paper, we improve the transferability of adversarial examples by incorporating the Bayesian formulation into both the model parameters and model input, enabling their joint diversification. We demonstrate that combination of Bayesian formulations for both the model input and model parameters yields significant improvements in transferability. By introducing advanced approximations of the posterior distribution over the model input, adversarial transferability achieves further enhancement, surpassing all state-of-the-arts when attacking without model fine-tuning. Additionally, we propose a principled approach to fine-tune model parameters within this Bayesian framework. Extensive experiments demonstrate that our method achieves a new state-of-the-art in transfer-based attacks, significantly improving the average success rate on ImageNet and CIFAR-10. Code at: https://github.com/qizhangli/MoreBayesian-jrnl.
comment: Accepted by TCSVT
♻ ☆ Are High-Degree Representations Really Unnecessary in Equivariant Graph Neural Networks?
Equivariant Graph Neural Networks (GNNs) that incorporate E(3) symmetry have achieved significant success in various scientific applications. As one of the most successful models, EGNN leverages a simple scalarization technique to perform equivariant message passing over only Cartesian vectors (i.e., 1st-degree steerable vectors), enjoying greater efficiency and efficacy compared to equivariant GNNs using higher-degree steerable vectors. This success suggests that higher-degree representations might be unnecessary. In this paper, we disprove this hypothesis by exploring the expressivity of equivariant GNNs on symmetric structures, including $k$-fold rotations and regular polyhedra. We theoretically demonstrate that equivariant GNNs will always degenerate to a zero function if the degree of the output representations is fixed to 1 or other specific values. Based on this theoretical insight, we propose HEGNN, a high-degree version of EGNN to increase the expressivity by incorporating high-degree steerable vectors while maintaining EGNN's efficiency through the scalarization trick. Our extensive experiments demonstrate that HEGNN not only aligns with our theoretical analyses on toy datasets consisting of symmetric structures, but also shows substantial improvements on more complicated datasets such as $N$-body and MD17. Our theoretical findings and empirical results potentially open up new possibilities for the research of equivariant GNNs.
♻ ☆ Benchmarking is Broken -- Don't Let AI be its Own Judge NeurIPS 2025
The meteoric rise of AI, with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench (with its prototype implementation at https://www.peerbench.ai/), a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.
comment: 14 pages; Accepted to NeurIPS 2025. Link to poster: https://neurips.cc/virtual/2025/poster/121919; Link to project website: https://www.peerbench.ai/
♻ ☆ Semantically Guided Action Anticipation
Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. Our method defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. We empirically demonstrate our method's superiority in domain adaptation tasks across four diverse image and video datasets. Remarkably, we surpass previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.
♻ ☆ CSI-BERT2: A BERT-inspired Framework for Efficient CSI Prediction and Classification in Wireless Communication and Sensing
Channel state information (CSI) is a fundamental component in both wireless communication and sensing systems, enabling critical functions such as radio resource optimization and environmental perception. In wireless sensing, data scarcity and packet loss hinder efficient model training, while in wireless communication, high-dimensional CSI matrices and short coherent times caused by high mobility present challenges in CSI estimation.To address these issues, we propose a unified framework named CSI-BERT2 for CSI prediction and classification tasks, built on CSI-BERT, which adapts BERT to capture the complex relationships among CSI sequences through a bidirectional self-attention mechanism. We introduce a two-stage training method that first uses a mask language model (MLM) to enable the model to learn general feature extraction from scarce datasets in an unsupervised manner, followed by fine-tuning for specific downstream tasks. Specifically, we extend MLM into a mask prediction model (MPM), which efficiently addresses the CSI prediction task. To further enhance the representation capacity of CSI data, we modify the structure of the original CSI-BERT. We introduce an adaptive re-weighting layer (ARL) to enhance subcarrier representation and a multi-layer perceptron (MLP)-based temporal embedding module to mitigate temporal information loss problem inherent in the original Transformer.Extensive experiments on both real-world collected and simulated datasets demonstrate that CSI-BERT2 achieves state-of-the-art performance across all tasks. Our results further show that CSI-BERT2 generalizes effectively across varying sampling rates and robustly handles discontinuous CSI sequences caused by packet loss-challenges that conventional methods fail to address. The dataset and code are publicly available at https://github.com/RS2002/CSI-BERT2 .
♻ ☆ Enabling Few-Shot Alzheimer's Disease Diagnosis on Biomarker Data with Tabular LLMs
Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.
comment: accepted by ACM-BCB'25: ACM Conference on Bioinformatics, Computational Biology, and Health Informatics [ACM SIGBio Best Paper Award]
♻ ☆ $Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization NeurIPS2025
Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.
comment: Accepted by NeurIPS2025
♻ ☆ Latent-Domain Predictive Neural Speech Coding
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques. Code and models are available at https://github.com/microsoft/TF-Codec.
comment: Accepted by IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING (TASLP). Code and models are available at https://github.com/microsoft/TF-Codec
♻ ☆ When Less Is More: Binary Feedback Can Outperform Ordinal Comparisons in Ranking Recovery
Paired comparison data, where users evaluate items in pairs, play a central role in ranking and preference learning tasks. While ordinal comparison data intuitively offer richer information than binary comparisons, this paper challenges that conventional wisdom. We propose a general parametric framework for modeling ordinal paired comparisons without ties. The model adopts a generalized additive structure, featuring a link function that quantifies the preference difference between two items and a pattern function that governs the distribution over ordinal response levels. This framework encompasses classical binary comparison models as special cases, by treating binary responses as binarized versions of ordinal data. Within this framework, we show that binarizing ordinal data can significantly improve the accuracy of ranking recovery. Specifically, we prove that under the counting algorithm, the ranking error associated with binary comparisons exhibits a faster exponential convergence rate than that of ordinal data. Furthermore, we characterize a substantial performance gap between binary and ordinal data in terms of a signal-to-noise ratio (SNR) determined by the pattern function. We identify the pattern function that minimizes the SNR and maximizes the benefit of binarization. Extensive simulations and a real application on the MovieLens dataset further corroborate our theoretical findings.
♻ ☆ Studying the Korean Word-Chain Game with RLVR: Mitigating Reward Conflicts via Curriculum Learning
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training large language models (LLMs) with stronger reasoning abilities. It has also been applied to a variety of logic puzzles. In this work, we study the Korean word-chain game using RLVR. We show that rule-derived rewards can naturally conflict, and demonstrate through experiments that a curriculum-learning scheme mitigates these conflicts. Our findings motivate further studies of puzzle tasks in diverse languages.
comment: 10 pages
♻ ☆ Statistical post-processing yields accurate probabilistic forecasts from Artificial Intelligence weather models
Artificial Intelligence (AI) weather models are now reaching operational-grade performance for some variables, but like traditional Numerical Weather Prediction (NWP) models, they exhibit systematic biases and reliability issues. We test the application of the Bureau of Meteorology's existing statistical post-processing system, IMPROVER, to ECMWF's deterministic Artificial Intelligence Forecasting System (AIFS), and compare results against post-processed outputs from the ECMWF HRES and ENS models. Without any modification to processing workflows, post-processing yields comparable accuracy improvements for AIFS as for traditional NWP forecasts, in both expected value and probabilistic outputs. We show that blending AIFS with NWP models improves overall forecast skill, even when AIFS alone is not the most accurate component. These findings show that statistical post-processing methods developed for NWP are directly applicable to AI models, enabling national meteorological centres to incorporate AI forecasts into existing workflows in a low-risk, incremental fashion.
♻ ☆ Learning Power Flow with Confidence: A Probabilistic Guarantee Framework for Voltage Risk
The absence of formal performance guarantees in machine learning (ML) has limited its adoption for safety-critical power system applications, where confidence and interpretability are as vital as accuracy. In this work, we present a probabilistic guarantee for power flow learning and voltage risk estimation, derived through the framework of Gaussian Process (GP) regression. Specifically, we establish a bound on the expected estimation error that connects the GP's predictive variance to confidence in voltage risk estimates, ensuring statistical equivalence with Monte Carlo-based ACPF risk quantification. To enhance model learnability in the low-data regime, we first design the Vertex-Degree Kernel (VDK), a topology-aware additive kernel that decomposes voltage-load interactions into local neighborhoods for efficient large-scale learning. Building on this, we introduce a network-swipe active learning (AL) algorithm that adaptively samples informative operating points and provides a principled stopping criterion without requiring out-of-sample validation. Together, these developments mitigate the principal bottleneck of ML-based power flow-its lack of guaranteed reliability-by combining data efficiency with analytical assurance. Empirical evaluations across IEEE 118-, 500-, and 1354-bus systems confirm that the proposed VDK-GP achieves mean absolute voltage errors below 1E-03 p.u., reproduces Monte Carlo-level voltage risk estimates with 15x fewer ACPF computations, and achieves over 120x reduction in evaluation time while conservatively bounding violation probabilities.
comment: 10 pages
♻ ☆ Random Scaling for Emergent Capabilities
Language models famously improve under a smooth scaling law, but some specific capabilities exhibit sudden breakthroughs in performance. While advocates of "emergence" view breakthroughs as unlocked capabilities, others attribute them to thresholding effects on noncontinuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. In a case study of inverse scaling, we show that even as the probability of a successful run declines, the average performance of a successful run increases monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LM populations. Our observations hold true even under continuous loss metrics, confirming that random variation must be considered when predicting a model's performance from its scale.
comment: 20 pages
♻ ☆ Probabilistic QoS Metric Forecasting in Delay-Tolerant Networks Using Conditional Diffusion Models on Latent Dynamics
Active QoS metric prediction, commonly employed in the maintenance and operation of DTN, could enhance network performance regarding latency, throughput, energy consumption, and dependability. Naturally formulated as a multivariate time series forecasting problem, it attracts substantial research efforts. Traditional mean regression methods for time series forecasting cannot capture the data complexity adequately, resulting in deteriorated performance in operational tasks in DTNs such as routing. This paper formulates the prediction of QoS metrics in DTN as a probabilistic forecasting problem on multivariate time series, where one could quantify the uncertainty of forecasts by characterizing the distribution of these samples. The proposed approach hires diffusion models and incorporates the latent temporal dynamics of non-stationary and multi-mode data into them. Extensive experiments demonstrate the efficacy of the proposed approach by showing that it outperforms the popular probabilistic time series forecasting methods.
♻ ☆ A Comprehensive Survey on Data Augmentation
Data augmentation is a series of techniques that generate high-quality artificial data by manipulating existing data samples. By leveraging data augmentation techniques, AI models can achieve significantly improved applicability in tasks involving scarce or imbalanced datasets, thereby substantially enhancing AI models' generalization capabilities. Existing literature surveys only focus on a certain type of specific modality data and categorize these methods from modality-specific and operation-centric perspectives, which lacks a consistent summary of data augmentation methods across multiple modalities and limits the comprehension of how existing data samples serve the data augmentation process. To bridge this gap, this survey proposes a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities by investigating how to take advantage of the intrinsic relationship between and within instances. Additionally, it categorizes data augmentation methods across five data modalities through a unified inductive approach.
comment: 20 pages; accepted to IEEE TKDE
♻ ☆ Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization
Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose T\'yr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that T\'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.
♻ ☆ Dynamic Rank Adjustment for Accurate and Efficient Neural Network Training
Low-rank training methods reduce the number of trainable parameters by re-parameterizing the weights with matrix decompositions (e.g., singular value decomposition). However, enforcing a fixed low-rank structure caps the rank of the weight matrices and can hinder the model's ability to learn complex patterns. Furthermore, the effective rank of the model's weights tends to decline during training, and this drop is accelerated when the model is reparameterized into a low-rank structure. In this study, we argue that strategically interleaving full-rank training epochs within low-rank training epochs can effectively restore the rank of the model's weights. Based on our findings, we propose a general dynamic-rank training framework that is readily applicable to a wide range of neural-network tasks. We first describe how to adjust the rank of weight matrix to alleviate the inevitable rank collapse that arises during training, and then present extensive empirical results that validate our claims and demonstrate the efficacy of the proposed framework. Our empirical study shows that the proposed method achieves almost the same computational cost as SVD-based low-rank training while achieving a comparable accuracy to full-rank training across various benchmarks.
♻ ☆ Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs NeurIPS2025
Large language models (LLMs) excel at complex tasks thanks to advances in their reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode's contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates. Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.
comment: Accepted by NeurIPS2025
♻ ☆ Efficient Prediction of SO(3)-Equivariant Hamiltonian Matrices via SO(2) Local Frames
We consider the task of predicting Hamiltonian matrices to accelerate electronic structure calculations, which plays an important role in physics, chemistry, and materials science. Motivated by the inherent relationship between the off-diagonal blocks of the Hamiltonian matrix and the SO(2) local frame, we propose a novel and efficient network, called QHNetV2, that achieves global SO(3) equivariance without the costly SO(3) Clebsch-Gordan tensor products. This is achieved by introducing a set of new efficient and powerful SO(2)-equivariant operations and performing all off-diagonal feature updates and message passing within SO(2) local frames, thereby eliminating the need of SO(3) tensor products. Moreover, a continuous SO(2) tensor product is performed within the SO(2) local frame at each node to fuse node features, mimicking the symmetric contraction operation. Extensive experiments on the large QH9 and MD17 datasets demonstrate that our model achieves superior performance across a wide range of molecular structures and trajectories, highlighting its strong generalization capability. The proposed SO(2) operations on SO(2) local frames offer a promising direction for scalable and symmetry-aware learning of electronic structures. Our code will be released as part of the AIRS library https://github.com/divelab/AIRS.
comment: Code available at: https://github.com/divelab/AIRS
♻ ☆ Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs NeurIPS 2020
We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.
comment: 76 pages, 4 figures, 2 tables; Journal version of the NeurIPS 2020 paper; Accepted to JMLR
♻ ☆ Self-Evolving LLMs via Continual Instruction Tuning
In real-world industrial settings, large language models (LLMs) must learn continually to keep pace with diverse and evolving tasks, requiring self-evolution to refine knowledge under dynamic data distributions. However, existing continual learning (CL) approaches, such as replay and parameter isolation, often suffer from catastrophic forgetting: training on new tasks degrades performance on earlier ones by overfitting to the new distribution and weakening generalization.We propose MoE-CL, a parameter-efficient adversarial mixture-of-experts framework for industrial-scale, self-evolving continual instruction tuning of LLMs. MoE-CL uses a dual-expert design: (1) a dedicated LoRA expert per task to preserve task-specific knowledge via parameter independence, mitigating forgetting; and (2) a shared LoRA expert to enable cross-task transfer. To prevent transferring task-irrelevant noise through the shared pathway, we integrate a task-aware discriminator within a GAN. The discriminator encourages the shared expert to pass only task-aligned information during sequential training. Through adversarial learning, the shared expert acquires generalized representations that mimic the discriminator, while dedicated experts retain task-specific details, balancing knowledge retention and cross-task generalization and thereby supporting self-evolution.Extensive experiments on the public MTL5 benchmark and an industrial Tencent3 benchmark validate the effectiveness of MoE-CL for continual instruction tuning. In real-world A/B testing for content compliance review on the Tencent Video platform, MoE-CL reduced manual review costs by 15.3%. These results demonstrate that MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical.
♻ ☆ SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing
Low-latency symbolic music generation is essential for real-time improvisation and human-AI co-creation. Existing transformer-based models, however, face a trade-off between inference speed and musical quality. Traditional acceleration techniques such as embedding pooling significantly degrade quality, while recently proposed Byte Pair Encoding (BPE) methods - though effective on single-track piano data - suffer large performance drops in multi-track settings, as revealed by our analysis. We propose Attribute-Specialized Key-Value Head Sharing (AS-KVHS), adapted to music's structured symbolic representation, achieving about 30% inference speedup with only a negligible (about 0.4%) quality drop in objective evaluations and slight improvements in subjective listening tests. Our main contributions are (1) the first systematic study of BPE's generalizability in multi-track symbolic music, and (2) the introduction of AS-KVHS for low-latency symbolic music generation. Beyond these, we also release SAGE-Music, an open-source benchmark that matches or surpasses state-of-the-art models in generation quality.
comment: Withdrawn after identifying that results in Section 5 require additional re-analysis before public dissemination
♻ ☆ A Brain-to-Population Graph Learning Framework for Diagnosing Brain Disorders
Recent developed graph-based methods for diagnosing brain disorders using functional connectivity highly rely on predefined brain atlases, but overlook the rich information embedded within atlases and the confounding effects of site and phenotype variability. To address these challenges, we propose a two-stage Brain-to-Population Graph Learning (B2P-GL) framework that integrates the semantic similarity of brain regions and condition-based population graph modeling. In the first stage, termed brain representation learning, we leverage brain atlas knowledge from GPT-4 to enrich the graph representation and refine the brain graph through an adaptive node reassignment graph attention network. In the second stage, termed population disorder diagnosis, phenotypic data is incorporated into population graph construction and feature fusion to mitigate confounding effects and enhance diagnosis performance. Experiments on the ABIDE I, ADHD-200, and Rest-meta-MDD datasets show that B2P-GL outperforms state-of-the-art methods in prediction accuracy while enhancing interpretability. Overall, our proposed framework offers a reliable and personalized approach to brain disorder diagnosis, advancing clinical applicability.
comment: this paper has been submitted for possible publication
♻ ☆ Towards A Universally Transferable Acceleration Method for Density Functional Theory
Recently, sophisticated deep learning-based approaches have been developed for generating efficient initial guesses to accelerate the convergence of density functional theory (DFT) calculations. While the actual initial guesses are often density matrices (DM), quantities that can convert into density matrices also qualify as alternative forms of initial guesses. Hence, existing works mostly rely on the prediction of the Hamiltonian matrix for obtaining high-quality initial guesses. However, the Hamiltonian matrix is both numerically difficult to predict and intrinsically non-transferable, hindering the application of such models in real scenarios. In light of this, we propose a method that constructs DFT initial guesses by predicting the electron density in a compact auxiliary basis representation using E(3)-equivariant neural networks. Trained on small molecules with up to 20 atoms, our model is able to achieve an average 33.3% self-consistent field (SCF) step reduction on systems up to 60 atoms, substantially outperforming Hamiltonian-centric and DM-centric models. Critically, this acceleration remains nearly constant with increasing system sizes and exhibits strong transferring behaviors across orbital basis sets and exchange-correlation (XC) functionals. To the best of our knowledge, this work represents the first and robust candidate for a universally transferable DFT acceleration method. We are also releasing the SCFbench dataset and its accompanying code to facilitate future research in this promising direction.
♻ ☆ Non-convex entropic mean-field optimization via Best Response flow NeurIPS 2025
We study the problem of minimizing non-convex functionals on the space of probability measures, regularized by the relative entropy (KL divergence) with respect to a fixed reference measure, as well as the corresponding problem of solving entropy-regularized non-convex-non-concave min-max problems. We utilize the Best Response flow (also known in the literature as the fictitious play flow) and study how its convergence is influenced by the relation between the degree of non-convexity of the functional under consideration, the regularization parameter and the tail behaviour of the reference measure. In particular, we demonstrate how to choose the regularizer, given the non-convex functional, so that the Best Response operator becomes a contraction with respect to the $L^1$-Wasserstein distance, which ensures the existence of its unique fixed point that is then shown to be the unique global minimizer for our optimization problem. This extends recent results where the Best Response flow was applied to solve convex optimization problems regularized by the relative entropy with respect to arbitrary reference measures, and with arbitrary values of the regularization parameter. Our results explain precisely how the assumption of convexity can be relaxed, at the expense of making a specific choice of the regularizer. Additionally, we demonstrate how these results can be applied in reinforcement learning in the context of policy optimization for Markov Decision Processes and Markov games with softmax parametrized policies in the mean-field regime.
comment: NeurIPS 2025
♻ ☆ IterMask3D: Unsupervised Anomaly Detection and Segmentation with Test-Time Iterative Mask Refinement in 3D Brain MR
Unsupervised anomaly detection and segmentation methods train a model to learn the training distribution as `normal'. In the testing phase, they identify patterns that deviate from this normal distribution as `anomalies'. To learn the `normal' distribution, prevailing methods corrupt the images and train a model to reconstruct them. During testing, the model attempts to reconstruct corrupted inputs based on the learned `normal' distribution. Deviations from this distribution lead to high reconstruction errors, which indicate potential anomalies. However, corrupting an input image inevitably causes information loss even in normal regions, leading to suboptimal reconstruction and an increased risk of false positives. To alleviate this, we propose $\rm{IterMask3D}$, an iterative spatial mask-refining strategy designed for 3D brain MRI. We iteratively spatially mask areas of the image as corruption and reconstruct them, then shrink the mask based on reconstruction error. This process iteratively unmasks `normal' areas to the model, whose information further guides reconstruction of `normal' patterns under the mask to be reconstructed accurately, reducing false positives. In addition, to achieve better reconstruction performance, we also propose using high-frequency image content as additional structural information to guide the reconstruction of the masked area. Extensive experiments on the detection of both synthetic and real-world imaging artifacts, as well as segmentation of various pathological lesions across multiple MRI sequences, consistently demonstrate the effectiveness of our proposed method. Code is available at https://github.com/ZiyunLiang/IterMask3D.
comment: Published in Medical Image Analysis
♻ ☆ Superior Molecular Representations from Intermediate Encoder Layers
Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we first analyze the information flow in five diverse molecular encoders and find that intermediate layers retain more general-purpose features, whereas the final-layer specializes and compresses information. We then perform an empirical layer-wise evaluation across 22 property prediction tasks. We find that using frozen embeddings from optimal intermediate layers improves downstream performance by an average of 5.4%, up to 28.6%, compared to the final-layer. Furthermore, finetuning encoders truncated at intermediate depths achieves even greater average improvements of 8.5%, with increases as high as 40.8%, obtaining new state-of-the-art results on several benchmarks. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code will be made publicly available.
♻ ☆ TASER: Table Agents for Schema-guided Extraction and Recommendation
Real-world financial documents report essential information about an entity's financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.
♻ ☆ Re-uploading quantum data: A universal function approximator for quantum inputs
Quantum data re-uploading has proved powerful for classical inputs, where repeatedly encoding features into a small circuit yields universal function approximation. Extending this idea to quantum inputs remains underexplored, as the information contained in a quantum state is not directly accessible in classical form. We propose and analyze a quantum data re-uploading architecture in which a qubit interacts sequentially with fresh copies of an arbitrary input state. The circuit can approximate any bounded continuous function using only one ancilla qubit and single-qubit measurements. By alternating entangling unitaries with mid-circuit resets of the input register, the architecture realizes a discrete cascade of completely positive and trace-preserving maps, analogous to collision models in open quantum system dynamics. Our framework provides a qubit-efficient and expressive approach to designing quantum machine learning models that operate directly on quantum data.
comment: 24 pages, 11 figures
♻ ☆ Constrained belief updates explain geometric structures in transformer representations
What computational structures emerge in transformers trained on next-token prediction? In this work, we provide evidence that transformers implement constrained Bayesian belief updating -- a parallelized version of partial Bayesian inference shaped by architectural constraints. We integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models that generate rich geometric patterns in neural activations. Our primary analysis focuses on single-layer transformers, revealing how the first attention layer implements these constrained updates, with extensions to multi-layer architectures demonstrating how subsequent layers refine these representations. We find that attention carries out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure. We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail -- including the attention pattern, OV-vectors, and embedding vectors -- by modifying the equations for optimal future token predictions to account for the architectural constraints of attention. Our approach provides a principled lens on how architectural constraints shape the implementation of optimal prediction, revealing why transformers develop specific intermediate geometric structures.
♻ ☆ Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.
comment: Published at EMNLP 2025; Code and data available at https://github.com/gxx27/time_unlearn
♻ ☆ Predictive Preference Learning from Human Interventions NeurIPS 2025
Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent's action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https://metadriverse.github.io/ppl
comment: NeurIPS 2025 Spotlight. Project page: https://metadriverse.github.io/ppl
♻ ☆ DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.
♻ ☆ Confidence Calibration in Large Language Model-Based Entity Matching
This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.
comment: 9 pages, 2 figures. UncertaiNLP 2025 Workshop @ EMNLP Camera Ready
♻ ☆ Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?
Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1), probing layer-by-layer representations across eight linguistic tasks in English. Consistent with earlier findings, we find that hierarchical organization persists in modern models: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena. We dive deeper, conducting an in-depth multilingual analysis of two specific linguistic properties - lexical identity and inflectional morphology - that help disentangle form from meaning. We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers. Additional analyses of attention mechanisms, steering vectors, and pretraining checkpoints reveal where this information resides within layers, how it can be functionally manipulated, and how representations evolve during pretraining. Taken together, our findings suggest that, even with substantial advances in LLM technologies, transformer models learn to organize linguistic information in similar ways, regardless of model architecture, size, or training regime, indicating that these properties are important for next token prediction. Our code is available at https://github.com/ml5885/model_internal_sleuthing
♻ ☆ Electrostatics from Laplacian Eigenbasis for Neural Network Interatomic Potentials
In this work, we introduce Phi-Module, a universal plugin module that enforces Poisson's equation within the message-passing framework to learn electrostatic interactions in a self-supervised manner. Specifically, each atom-wise representation is encouraged to satisfy a discretized Poisson's equation, making it possible to acquire a potential {\phi} and corresponding charges \r{ho} linked to the learnable Laplacian eigenbasis coefficients of a given molecular graph. We then derive an electrostatic energy term, crucial for improved total energy predictions. This approach integrates seamlessly into any existing neural potential with insignificant computational overhead. Our results underscore how embedding a first-principles constraint in neural interatomic potentials can significantly improve performance while remaining hyperparameter-friendly, memory-efficient, and lightweight in training. Code will be available at https://github.com/dunnolab/phi-module.
♻ ☆ KL-regularization Itself is Differentially Private in Bandits and RLHF
Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free''. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.
♻ ☆ On the Fundamental Impossibility of Hallucination Control in Large Language Models
This paper establishes a fundamental Impossibility Theorem: no LLM performing non-trivial knowledge aggregation can simultaneously achieve truthful knowledge representation, semantic information conservation, complete revelation of relevant knowledge, and knowledge-constrained optimality. This impossibility stems from the mathematical structure of information aggregation, not from engineering limitations. We prove this by modeling inference as an auction of ideas, where distributed components compete to influence responses using their encoded knowledge. The proof employs three independent approaches: mechanism design (Green-Laffont theorem), proper scoring rules (Savage), and transformer architecture analysis (log-sum-exp convexity). We introduce the semantic information measure and the emergence operator to analyze computationally bounded and unbounded reasoning. Bounded reasoning makes latent information accessible, enabling gradual insights and creativity, while unbounded reasoning makes all derivable knowledge immediately accessible while preserving the semantic content. We prove the conservation-reasoning dichotomy: meaningful reasoning necessarily violates information conservation. Our framework suggests that hallucination and imagination are mathematically identical, and both violate at least one of the four essential properties. The Jensen gap in transformer attention quantifies this violation as excess confidence beyond constituent evidence. This unified view explains why capable models must balance truthfulness against creativity. These results provide principled foundations for managing hallucination trade-offs in AI systems. Rather than eliminating hallucination, we should optimize these inevitable trade-offs for specific applications. We conclude with philosophical implications connecting the impossibility to fundamental limits of reason.
comment: Mathematics debugged: added examples, corrected transformer example, re-edited, typos removed
♻ ☆ Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.
♻ ☆ PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.
♻ ☆ TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness
Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods and uncover meaningful intuitions linking specific design choices to forecasting scenarios. Furthermore, we release a practical toolkit within TimeRecipe that recommends suitable model architectures based on these empirical insights. The benchmark is available at: https://github.com/AdityaLab/TimeRecipe.
comment: 47 pages, 1 figure, 30 tables
♻ ☆ MAFA: A multi-agent framework for annotation
Modern consumer banking applications require accurate and efficient retrieval of information in response to user queries. Mapping user utterances to the most relevant Frequently Asked Questions (FAQs) is a crucial component of these systems. Traditional approaches often rely on a single model or technique, which may not capture the nuances of diverse user inquiries. In this paper, we introduce a multi-agent framework for FAQ annotation that combines multiple specialized agents with different approaches and a judge agent that reranks candidates to produce optimal results. Our agents utilize a structured reasoning approach inspired by Attentive Reasoning Queries (ARQs), which guides them through systematic reasoning steps using targeted, task-specific JSON queries. Our framework features a few-shot example strategy, where each agent receives different few-shots, enhancing ensemble diversity and coverage of the query space. We evaluate our framework on a real-world major bank dataset as well as public benchmark datasets (LCQMC and FiQA), demonstrating significant improvements over single-agent approaches across multiple metrics, including a 14% increase in Top-1 accuracy, an 18% increase in Top-5 accuracy, and a 12% improvement in Mean Reciprocal Rank on our dataset, and similar gains on public benchmarks when compared with traditional and single-agent annotation techniques. Our framework is particularly effective at handling ambiguous queries, making it well-suited for deployment in production banking applications while showing strong generalization capabilities across different domains and languages.
♻ ☆ Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.
♻ ☆ Kernel Neural Operators (KNOs) for Scalable, Memory-efficient, Geometrically-flexible Operator Learning
This paper introduces the Kernel Neural Operator (KNO), a provably convergent operator-learning architecture that utilizes compositions of deep kernel-based integral operators for function-space approximation of operators (maps from functions to functions). The KNO decouples the choice of kernel from the numerical integration scheme (quadrature), thereby naturally allowing for operator learning with explicitly-chosen trainable kernels on irregular geometries. On irregular domains, this allows the KNO to utilize domain-specific quadrature rules. To help ameliorate the curse of dimensionality, we also leverage an efficient dimension-wise factorization algorithm on regular domains. More importantly, the ability to explicitly specify kernels also allows the use of highly expressive, non-stationary, neural anisotropic kernels whose parameters are computed by training neural networks. Numerical results demonstrate that on existing benchmarks the training and test accuracy of KNOs is comparable to or higher than popular operator learning techniques while typically using an order of magnitude fewer trainable parameters, with the more expressive kernels proving important to attaining high accuracy. KNOs thus facilitate low-memory, geometrically-flexible, deep operator learning, while retaining the implementation simplicity and transparency of traditional kernel methods from both scientific computing and machine learning.
comment: 9 pages + 13 page appendix, 5 figures
♻ ☆ PoE-World: Compositional World Modeling with Products of Programmatic Experts
Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.
♻ ☆ Generative AI Meets Future Cities: Towards an Era of Autonomous Urban Intelligence
The two fields of urban planning and artificial intelligence (AI) arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we introduce the importance of urban planning from the sustainability, living, economic, disaster, and environmental perspectives. We review the fundamental concepts of urban planning and relate these concepts to crucial open problems of machine learning, including adversarial learning, generative neural networks, deep encoder-decoder networks, conversational AI, and geospatial and temporal machine learning, thereby assaying how AI can contribute to modern urban planning. Thus, a central problem is automated land-use configuration, which is formulated as the generation of land uses and building configuration for a target area from surrounding geospatial, human mobility, social media, environment, and economic activities. Finally, we delineate some implications of AI for urban planning and propose key research areas at the intersection of both topics.
comment: Submitting to the ACM AI Letters
♻ ☆ Generating High Dimensional User-Specific Wireless Channels using Diffusion Models
Deep neural network (DNN)-based algorithms are emerging as an important tool for many physical and MAC layer functions in future wireless communication systems, including for large multi-antenna channels. However, training such models typically requires a large dataset of high-dimensional channel measurements, which are very difficult and expensive to obtain. This paper introduces a novel method for generating synthetic wireless channel data using diffusion-based models to produce user-specific channels that accurately reflect real-world wireless environments. Our approach employs a conditional denoising diffusion implicit model (cDDIM) framework, effectively capturing the relationship between user location and multi-antenna channel characteristics. We generate synthetic high fidelity channel samples using user positions as conditional inputs, creating larger augmented datasets to overcome measurement scarcity. The utility of this method is demonstrated through its efficacy in training various downstream tasks such as channel compression and beam alignment. Our diffusion-based augmentation approach achieves over a 1-2 dB gain in NMSE for channel compression, and an 11dB SNR boost in beamforming compared to prior methods, such as noise addition or the use of generative adversarial networks (GANs).
♻ ☆ Probabilistic Reasoning with LLMs for k-anonymity Estimation NeurIPS 2025
Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.
comment: 10 pages, Accepted to NeurIPS 2025
♻ ☆ Uni-LoRA: One Vector is All You Need NeurIPS 2025
Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a "one-vector-only" solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.
comment: NeurIPS 2025 Spotlight
♻ ☆ Forecasting Generative Amplification
Generative networks are perfect tools to enhance the speed and precision of LHC simulations. It is important to understand their statistical precision, especially when generating events beyond the size of the training dataset. We present two complementary methods to estimate the amplification factor without large holdout datasets. Averaging amplification uses Bayesian networks or ensembling to estimate amplification from the precision of integrals over given phase-space volumes. Differential amplification uses hypothesis testing to quantify amplification without any resolution loss. Applied to state-of-the-art event generators, both methods indicate that amplification is possible in specific regions of phase space, but not yet across the entire distribution.
comment: 23 pages, 15 figures. v2: added link to github repo, extended acknowledgements. v3: updated conventions and refined text, now 25 pages
♻ ☆ Tokenizing Single-Channel EEG with Time-Frequency Motif Learning
Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time-frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to 17% improvement in Cohen's Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10-20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by 14%. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability. Code is available at https://github.com/Jathurshan0330/TFM-Tokenizer.
♻ ☆ BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants and Noise Contrastive Estimation NeurIPS 2025
Efficient sampling from the Boltzmann distribution given its energy function is a key challenge for modeling complex physical systems such as molecules. Boltzmann Generators address this problem by leveraging continuous normalizing flows to transform a simple prior into a distribution that can be reweighted to match the target using sample likelihoods. Despite the elegance of this approach, obtaining these likelihoods requires computing costly Jacobians during integration, which is impractical for large molecular systems. To overcome this difficulty, we train an energy-based model (EBM) to approximate likelihoods using both noise contrastive estimation (NCE) and score matching, which we show outperforms the use of either objective in isolation. On 2d synthetic systems where failure can be easily visualized, NCE improves mode weighting relative to score matching alone. On alanine dipeptide, our method yields free energy profiles and energy distributions that closely match those obtained using exact likelihoods while achieving $100\times$ faster inference. By training on multiple dipeptide systems, we show that our approach also exhibits effective transfer learning, generalizing to new systems at inference time and achieving at least a $6\times$ speedup over standard MD. While many recent efforts in generative modeling have prioritized models with fast sampling, our work demonstrates the design of models with accelerated likelihoods, enabling the application of reweighting schemes that ensure unbiased Boltzmann statistics at scale. Our code is available at https://github.com/RishalAggarwal/BoltzNCE.
comment: 26 pages, 10 figures, NeurIPS 2025
♻ ☆ Integrating feature selection and regression methods with technical indicators for predicting Apple Inc. stock prices
Stock price prediction is influenced by a variety of factors, including technical indicators, which makes Feature selection crucial for identifying the most relevant predictors. This study examines the impact of feature selection on stock price prediction accuracy using technical indicators. A total of 123 technical indicators and 10 regression models were evaluated using 13 years of Apple Inc. data. The primary goal is to identify the best combination of indicators and models for improved forecasting. The results show that a 3-day time window provides the highest prediction accuracy. Model performance was assessed using five error-based metrics. Among the models, Linear Regression and Ridge Regression achieved the best overall performance, each with a Mean Squared Error (MSE) of 0.00025. Applying feature selection significantly improved model accuracy. For example, the Multi-layered Perceptron Regression using Forward Selection improved by 56.47% over its baseline version. Support Vector Regression improved by 67.42%, and Linear Regression showed a 76.7% improvement when combined with Forward Selection. Ridge Regression also demonstrated a 72.82% enhancement. Additionally, Decision Tree, K-Nearest Neighbor, and Random Forest models showed varying levels of improvement when used with Backward Selection. The most effective technical indicators for stock price prediction were found to be Squeeze_pro, Percentage Price Oscillator, Thermo, Decay, Archer On-Balance Volume, Bollinger Bands, Squeeze, and Ichimoku. Overall, the study highlights that combining selected technical indicators with appropriate regression models can significantly enhance the accuracy and efficiency of stock price predictions.
comment: 18 pages, 9 figures,5 tables,45 references
♻ ☆ Improving Intrusion Detection with Domain-Invariant Representation Learning in Latent Space
Zero-day anomaly detection is critical in industrial applications where novel, unforeseen threats can compromise system integrity and safety. Traditional detection systems often fail to identify these unseen anomalies due to their reliance on in-distribution data. Domain generalization addresses this gap by leveraging knowledge from multiple known domains to detect out-of-distribution events. In this work, we introduce a multi-task representation learning technique that fuses information across related domains into a unified latent space. By jointly optimizing classification, reconstruction, and mutual information regularization losses, our method learns a minimal(bottleneck), domain-invariant representation that discards spurious correlations. This latent space decorrelation enhances generalization, enabling the detection of anomalies in unseen domains. Our experimental results demonstrate significant improvements in zero-day or novel anomaly detection across diverse anomaly detection datasets.
♻ ☆ All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.
comment: Version 2: updated related works section on LLM steganography
♻ ☆ Evaluating Sakana's AI Scientist: Bold Claims, Mixed Results, and a Promising Future?
A major step toward Artificial General Intelligence (AGI) and Super Intelligence is AI's ability to autonomously conduct research - what we term Artificial Research Intelligence (ARI). If machines could generate hypotheses, conduct experiments, and write research papers without human intervention, it would transform science. Sakana recently introduced the 'AI Scientist', claiming to conduct research autonomously, i.e. they imply to have achieved what we term Artificial Research Intelligence (ARI). The AI Scientist gained much attention, but a thorough independent evaluation has yet to be conducted. Our evaluation of the AI Scientist reveals critical shortcomings. The system's literature reviews produced poor novelty assessments, often misclassifying established concepts (e.g., micro-batching for stochastic gradient descent) as novel. It also struggles with experiment execution: 42% of experiments failed due to coding errors, while others produced flawed or misleading results. Code modifications were minimal, averaging 8% more characters per iteration, suggesting limited adaptability. Generated manuscripts were poorly substantiated, with a median of five citations, most outdated (only five of 34 from 2020 or later). Structural errors were frequent, including missing figures, repeated sections, and placeholder text like 'Conclusions Here'. Some papers contained hallucinated numerical results. Despite these flaws, the AI Scientist represents a leap forward in research automation. It generates full research manuscripts with minimal human input, challenging expectations of AI-driven science. Many reviewers might struggle to distinguish its work from human researchers. While its quality resembles a rushed undergraduate paper, its speed and cost efficiency are unprecedented, producing a full paper for USD 6 to 15 with 3.5 hours of human involvement, far outpacing traditional researchers.
comment: 20 pages
Computer Vision and Pattern Recognition 191
☆ DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.
Detect Anything via Next Point Prediction
Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.
comment: homepage: https://rex-omni.github.io/
☆ DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.
☆ CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations ICCV 2025
We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging and computer vision datasets show the benefit CuMPerLay on classification and segmentation performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.
comment: Appears at ICCV 2025
☆ ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.
☆ UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.
comment: Project page at https://thekevinli.github.io/unifusion/
Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report ICCV 2025
This paper reviews the AIM 2025 Efficient Real-World Deblurring using Single Images Challenge, which aims to advance in efficient real-blur restoration. The challenge is based on a new test set based on the well known RSBlur dataset. Pairs of blur and degraded images in this dataset are captured using a double-camera system. Participant were tasked with developing solutions to effectively deblur these type of images while fulfilling strict efficiency constraints: fewer than 5 million model parameters and a computational budget under 200 GMACs. A total of 71 participants registered, with 4 teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 31.1298 dB, showcasing the potential of efficient methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers in efficient real-world image deblurring.
comment: ICCV 2025 - AIM Workshop
☆ MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars
Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.
comment: 18 pages, 12 figures
☆ SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a \textbf{global reward} ensures the correctness of the overall visual semantics and layout, while a \textbf{local reward} refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ReasonBench from 43.82 to \textbf{46.75}. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.
comment: 20 pages, 8 figures, webpage can be seen in https://waynejin0918.github.io/srum_web/
☆ What If : Understanding Motion Through Sparse Interactions
Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed "pokes". Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at https://compvis.github.io/flow-poke-transformer.
comment: Project page and code: https://compvis.github.io/flow-poke-transformer
☆ Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction
Reconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our key insight is to estimate time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.
comment: Project page: https://tamu-visual-ai.github.io/usplat4d/
Efficient Perceptual Image Super Resolution: AIM 2025 Study and Benchmark ICCV 2025
This paper presents a comprehensive study and benchmark on Efficient Perceptual Super-Resolution (EPSR). While significant progress has been made in efficient PSNR-oriented super resolution, approaches focusing on perceptual quality metrics remain relatively inefficient. Motivated by this gap, we aim to replicate or improve the perceptual results of Real-ESRGAN while meeting strict efficiency constraints: a maximum of 5M parameters and 2000 GFLOPs, calculated for an input size of 960x540 pixels. The proposed solutions were evaluated on a novel dataset consisting of 500 test images of 4K resolution, each degraded using multiple degradation types, without providing the original high-quality counterparts. This design aims to reflect realistic deployment conditions and serves as a diverse and challenging benchmark. The top-performing approach manages to outperform Real-ESRGAN across all benchmark datasets, demonstrating the potential of efficient methods in the perceptual domain. This paper establishes the modern baselines for efficient perceptual super resolution.
comment: ICCV 2025 - AIM Workshop
☆ AnyUp: Universal Feature Upsampling
We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.
comment: Project Website: https://wimmerth.github.io/anyup/
☆ PET Head Motion Estimation Using Supervised Deep Learning with Attention
Head movement poses a significant challenge in brain positron emission tomography (PET) imaging, resulting in image artifacts and tracer uptake quantification inaccuracies. Effective head motion estimation and correction are crucial for precise quantitative image analysis and accurate diagnosis of neurological disorders. Hardware-based motion tracking (HMT) has limited applicability in real-world clinical practice. To overcome this limitation, we propose a deep-learning head motion correction approach with cross-attention (DL-HMC++) to predict rigid head motion from one-second 3D PET raw data. DL-HMC++ is trained in a supervised manner by leveraging existing dynamic PET scans with gold-standard motion measurements from external HMT. We evaluate DL-HMC++ on two PET scanners (HRRT and mCT) and four radiotracers (18F-FDG, 18F-FPEB, 11C-UCB-J, and 11C-LSN3172176) to demonstrate the effectiveness and generalization of the approach in large cohort PET studies. Quantitative and qualitative results demonstrate that DL-HMC++ consistently outperforms state-of-the-art data-driven motion estimation methods, producing motion-free images with clear delineation of brain structures and reduced motion artifacts that are indistinguishable from gold-standard HMT. Brain region of interest standard uptake value analysis exhibits average difference ratios between DL-HMC++ and gold-standard HMT to be 1.2 plus-minus 0.5% for HRRT and 0.5 plus-minus 0.2% for mCT. DL-HMC++ demonstrates the potential for data-driven PET head motion correction to remove the burden of HMT, making motion correction accessible to clinical populations beyond research settings. The code is available at https://github.com/maxxxxxxcai/DL-HMC-TMI.
comment: Accepted for publication in IEEE Transactions on Medical Imaging (TMI), 2025. This is the accepted manuscript version
☆ E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization NeurIPS 2025
The estimation of optical flow and 6-DoF ego-motion, two fundamental tasks in 3D vision, has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth. Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment. The former notably introduces bias in results and computational overhead, while the latter, which parametrizes the optical flow in terms of the scene depth and the camera motion, often converges to suboptimal local minima. To address these issues, we propose an unsupervised framework that jointly optimizes egomotion and optical flow via implicit spatial-temporal and geometric regularization. First, by modeling camera's egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency. As a result, our framework (called E-MoFlow) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems(NeurIPS 2025)
☆ VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.
☆ SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding
The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects' interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.
comment: Accepted by IEEE Transactions on Multimedia
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.
comment: Project page with code: https://zhuang2002.github.io/FlashVSR
☆ Personalized Federated Fine-Tuning of Vision Foundation Models for Healthcare
Foundation models open up new possibilities for the use of AI in healthcare. However, even when pre-trained on health data, they still need to be fine-tuned for specific downstream tasks. Furthermore, although foundation models reduce the amount of training data required to achieve good performance, obtaining sufficient data is still a challenge. This is due, in part, to restrictions on sharing and aggregating data from different sources to protect patients' privacy. One possible solution to this is to fine-tune foundation models via federated learning across multiple participating clients (i.e., hospitals, clinics, etc.). In this work, we propose a new personalized federated fine-tuning method that learns orthogonal LoRA adapters to disentangle general and client-specific knowledge, enabling each client to fully exploit both their own data and the data of others. Our preliminary results on real-world federated medical imaging tasks demonstrate that our approach is competitive against current federated fine-tuning methods.
comment: Accepted to the Symposium on Model Accountability, Sustainability and Healthcare (SMASH) 2025
☆ Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.
comment: https://github.com/ddlBoJack/Omni-Captioner
☆ Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce IRIS, an Interactive Reasoning with Images and Systems that evaluates MLLMs' ability to perceive, transform, and reason across complex visual-textual tasks under the think with images paradigm. IRIS comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, IRIS offers critical insights for advancing visual intelligence in MLLMs.
☆ SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.08% AUC gain.
comment: Technical Report
☆ Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis MICCAI 2025
Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.
comment: Accepted by iMIMIC at MICCAI 2025
☆ DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization
Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when only corrupted or noisy observations are available remains challenging. In this work, we propose a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data. Our proposed method, DiffEM, utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.
☆ EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels
Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at https://github.com/KPeng9510/ERELIFM.
comment: The source code is available at https://github.com/KPeng9510/ERELIFM
☆ MCOP: Multi-UAV Collaborative Occupancy Prediction
Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird's Eye View (BEV)-based approaches exhibit two main limitations: bounding-box representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects. To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead. Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.
☆ TerraCodec: Compressing Earth Observations
Earth observation (EO) satellites produce massive streams of multispectral image time series, posing pressing challenges for storage and transmission. Yet, learned EO compression remains fragmented, lacking publicly available pretrained models and misaligned with advances in compression for natural imagery. Image codecs overlook temporal redundancy, while video codecs rely on motion priors that fail to capture the radiometric evolution of largely static scenes. We introduce TerraCodec (TEC), a family of learned codecs tailored to EO. TEC includes efficient image-based variants adapted to multispectral inputs, as well as a Temporal Transformer model (TEC-TT) that leverages dependencies across time. To overcome the fixed-rate setting of today's neural codecs, we present Latent Repacking, a novel method for training flexible-rate transformer models that operate on varying rate-distortion settings. Trained on Sentinel-2 data, TerraCodec outperforms classical codecs, achieving 3-10x stronger compression at equivalent image quality. Beyond compression, TEC-TT enables zero-shot cloud inpainting, surpassing state-of-the-art methods on the AllClear benchmark. Our results establish bespoke, learned compression algorithms as a promising direction for Earth observation. Code and model weights will be released under a permissive license.
☆ On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation ICCV
In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives.
comment: Accepted at ICCVW 2025
☆ Zero-Shot CFC: Fast Real-World Image Denoising based on Cross-Frequency Consistency
Zero-shot denoisers address the dataset dependency of deep-learning-based denoisers, enabling the denoising of unseen single images. Nonetheless, existing zero-shot methods suffer from long training times and rely on the assumption of noise independence and a zero-mean property, limiting their effectiveness in real-world denoising scenarios where noise characteristics are more complicated. This paper proposes an efficient and effective method for real-world denoising, the Zero-Shot denoiser based on Cross-Frequency Consistency (ZSCFC), which enables training and denoising with a single noisy image and does not rely on assumptions about noise distribution. Specifically, image textures exhibit position similarity and content consistency across different frequency bands, while noise does not. Based on this property, we developed cross-frequency consistency loss and an ultralight network to realize image denoising. Experiments on various real-world image datasets demonstrate that our ZSCFC outperforms other state-of-the-art zero-shot methods in terms of computational efficiency and denoising performance.
comment: The British Machine Vision Conference
☆ WaterFlow: Explicit Physics-Prior Rectified Flow for Underwater Saliency Mask Generation
Underwater Salient Object Detection (USOD) faces significant challenges, including underwater image quality degradation and domain gaps. Existing methods tend to ignore the physical principles of underwater imaging or simply treat degradation phenomena in underwater images as interference factors that must be eliminated, failing to fully exploit the valuable information they contain. We propose WaterFlow, a rectified flow-based framework for underwater salient object detection that innovatively incorporates underwater physical imaging information as explicit priors directly into the network training process and introduces temporal dimension modeling, significantly enhancing the model's capability for salient object identification. On the USOD10K dataset, WaterFlow achieves a 0.072 gain in S_m, demonstrating the effectiveness and superiority of our method. The code will be published after the acceptance.
☆ Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.
☆ Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.
☆ LayerSync: Self-aligning Intermediate Layers
We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75x on ImageNet dataset and improved the generation quality by 23.6%. The code is available at https://github.com/vita-epfl/LayerSync.
☆ Unlocking Zero-Shot Plant Segmentation with Pl@ntNet Intelligence
We present a zero-shot segmentation approach for agricultural imagery that leverages Plantnet, a large-scale plant classification model, in conjunction with its DinoV2 backbone and the Segment Anything Model (SAM). Rather than collecting and annotating new datasets, our method exploits Plantnet's specialized plant representations to identify plant regions and produce coarse segmentation masks. These masks are then refined by SAM to yield detailed segmentations. We evaluate on four publicly available datasets of various complexity in terms of contrast including some where the limited size of the training data and complex field conditions often hinder purely supervised methods. Our results show consistent performance gains when using Plantnet-fine-tuned DinoV2 over the base DinoV2 model, as measured by the Jaccard Index (IoU). These findings highlight the potential of combining foundation models with specialized plant-centric models to alleviate the annotation bottleneck and enable effective segmentation in diverse agricultural scenarios.
☆ Learning Human Motion with Temporally Conditional Mamba
Learning human motion based on a time-dependent input signal presents a challenging yet impactful task with various applications. The goal of this task is to generate or estimate human movement that consistently reflects the temporal patterns of conditioning inputs. Existing methods typically rely on cross-attention mechanisms to fuse the condition with motion. However, this approach primarily captures global interactions and struggles to maintain step-by-step temporal alignment. To address this limitation, we introduce Temporally Conditional Mamba, a new mamba-based model for human motion generation. Our approach integrates conditional information into the recurrent dynamics of the Mamba block, enabling better temporally aligned motion. To validate the effectiveness of our method, we evaluate it on a variety of human motion tasks. Extensive experiments demonstrate that our model significantly improves temporal alignment, motion realism, and condition consistency over state-of-the-art approaches. Our project page is available at https://zquang2202.github.io/TCM.
comment: 10 pages
☆ MMOT: The First Challenging Benchmark for Drone-based Multispectral Multi-Object Tracking
Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial cues that enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce MMOT, the first challenging benchmark for drone-based multispectral multi-object tracking. It features three key characteristics: (i) Large Scale - 125 video sequences with over 488.8K annotations across eight categories; (ii) Comprehensive Challenges - covering diverse conditions such as extreme small targets, high-density scenarios, severe occlusions, and complex motion; and (iii) Precise Oriented Annotations - enabling accurate localization and reduced ambiguity under aerial perspectives. To better extract spectral features and leverage oriented annotations, we further present a multispectral and orientation-aware MOT scheme adapting existing methods, featuring: (i) a lightweight Spectral 3D-Stem integrating spectral features while preserving compatibility with RGB pretraining; (ii) an orientation-aware Kalman filter for precise state estimation; and (iii) an end-to-end orientation-adaptive transformer. Extensive experiments across representative trackers consistently show that multispectral input markedly improves tracking performance over RGB baselines, particularly for small and densely packed objects. We believe our work will advance drone-based multispectral multi-object tracking research. Our MMOT, code, and benchmarks are publicly available at https://github.com/Annzstbl/MMOT.
☆ CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving
End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training. CoIRL-AD introduces a competition-based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.
comment: 18 pages, 17 figures
☆ VISaGE: Understanding Visual Generics and Exceptions
While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.
comment: EMNLP 2025
☆ Unconditional Human Motion and Shape Generation via Balanced Score-Based Diffusion
Recent work has explored a range of model families for human motion generation, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion-based models. Despite their differences, many methods rely on over-parameterized input features and auxiliary losses to improve empirical results. These strategies should not be strictly necessary for diffusion models to match the human motion distribution. We show that on par with state-of-the-art results in unconditional human motion generation are achievable with a score-based diffusion model using only careful feature-space normalization and analytically derived weightings for the standard L2 score-matching loss, while generating both motion and shape directly, thereby avoiding slow post hoc shape recovery from joints. We build the method step by step, with a clear theoretical motivation for each component, and provide targeted ablations demonstrating the effectiveness of each proposed addition in isolation.
☆ Voronoi-Assisted Diffusion for Computing Unsigned Distance Fields from Unoriented Points
Unsigned Distance Fields (UDFs) provide a flexible representation for 3D shapes with arbitrary topology, including open and closed surfaces, orientable and non-orientable geometries, and non-manifold structures. While recent neural approaches have shown promise in learning UDFs, they often suffer from numerical instability, high computational cost, and limited controllability. We present a lightweight, network-free method, Voronoi-Assisted Diffusion (VAD), for computing UDFs directly from unoriented point clouds. Our approach begins by assigning bi-directional normals to input points, guided by two Voronoi-based geometric criteria encoded in an energy function for optimal alignment. The aligned normals are then diffused to form an approximate UDF gradient field, which is subsequently integrated to recover the final UDF. Experiments demonstrate that VAD robustly handles watertight and open surfaces, as well as complex non-manifold and non-orientable geometries, while remaining computationally efficient and stable.
☆ BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring
3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene reconstruction.However, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant challenge.The performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion blur.To solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred images.BSGS contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur distortions.To alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both stages.Furthermore, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.
Fast Visuomotor Policy for Robotic Manipulation
We present a fast and effective policy framework for robotic manipulation, named Energy Policy, designed for high-frequency robotic tasks and resource-constrained systems. Unlike existing robotic policies, Energy Policy natively predicts multimodal actions in a single forward pass, enabling high-precision manipulation at high speed. The framework is built upon two core components. First, we adopt the energy score as the learning objective to facilitate multimodal action modeling. Second, we introduce an energy MLP to implement the proposed objective while keeping the architecture simple and efficient. We conduct comprehensive experiments in both simulated environments and real-world robotic tasks to evaluate the effectiveness of Energy Policy. The results show that Energy Policy matches or surpasses the performance of state-of-the-art manipulation methods while significantly reducing computational overhead. Notably, on the MimicGen benchmark, Energy Policy achieves superior performance with at a faster inference compared to existing approaches.
☆ A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation
Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.
☆ MS-GAGA: Metric-Selective Guided Adversarial Generation Attack
We present MS-GAGA (Metric-Selective Guided Adversarial Generation Attack), a two-stage framework for crafting transferable and visually imperceptible adversarial examples against deepfake detectors in black-box settings. In Stage 1, a dual-stream attack module generates adversarial candidates: MNTD-PGD applies enhanced gradient calculations optimized for small perturbation budgets, while SG-PGD focuses perturbations on visually salient regions. This complementary design expands the adversarial search space and improves transferability across unseen models. In Stage 2, a metric-aware selection module evaluates candidates based on both their success against black-box models and their structural similarity (SSIM) to the original image. By jointly optimizing transferability and imperceptibility, MS-GAGA achieves up to 27% higher misclassification rates on unseen detectors compared to state-of-the-art attacks.
☆ A Function Centric Perspective On Flat and Sharp Minima
Flat minima are widely believed to correlate with improved generalisation in deep neural networks. However, this connection has proven more nuanced in recent studies, with both theoretical counterexamples and empirical exceptions emerging in the literature. In this paper, we revisit the role of sharpness in model performance, proposing that sharpness is better understood as a function-dependent property rather than a reliable indicator of poor generalisation. We conduct extensive empirical studies, from single-objective optimisation to modern image classification tasks, showing that sharper minima often emerge when models are regularised (e.g., via SAM, weight decay, or data augmentation), and that these sharp minima can coincide with better generalisation, calibration, robustness, and functional consistency. Across a range of models and datasets, we find that baselines without regularisation tend to converge to flatter minima yet often perform worse across all safety metrics. Our findings demonstrate that function complexity, rather than flatness alone, governs the geometry of solutions, and that sharper minima can reflect more appropriate inductive biases (especially under regularisation), calling for a function-centric reappraisal of loss landscape geometry.
comment: 26 pages, 26 tables, 63 figures, pre-print
☆ A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation
Chest Xray imaging is a widely used diagnostic tool in modern medicine, and its high utilization creates substantial workloads for radiologists. To alleviate this burden, vision language models are increasingly applied to automate Chest Xray radiology report generation (CXRRRG), aiming for clinically accurate descriptions while reducing manual effort. Conventional approaches, however, typically rely on single images, failing to capture the longitudinal context necessary for producing clinically faithful comparison statements. Recently, growing attention has been directed toward incorporating longitudinal data into CXR RRG, enabling models to leverage historical studies in ways that mirror radiologists diagnostic workflows. Nevertheless, existing surveys primarily address single image CXRRRG and offer limited guidance for longitudinal settings, leaving researchers without a systematic framework for model design. To address this gap, this survey provides the first comprehensive review of longitudinal radiology report generation (LRRG). Specifically, we examine dataset construction strategies, report generation architectures alongside longitudinally tailored designs, and evaluation protocols encompassing both longitudinal specific measures and widely used benchmarks. We further summarize LRRG methods performance, alongside analyses of different ablation studies, which collectively highlight the critical role of longitudinal information and architectural design choices in improving model performance. Finally, we summarize five major limitations of current research and outline promising directions for future development, aiming to lay a foundation for advancing this emerging field.
☆ Tensor Completion via Monotone Inclusion: Generalized Low-Rank Priors Meet Deep Denoisers
Missing entries in multi dimensional data pose significant challenges for downstream analysis across diverse real world applications. These data are naturally modeled as tensors, and recent completion methods integrating global low rank priors with plug and play denoisers have demonstrated strong empirical performance. However, these approaches often rely on empirical convergence alone or unrealistic assumptions, such as deep denoisers acting as proximal operators of implicit regularizers, which generally does not hold. To address these limitations, we propose a novel tensor completion framework grounded in the monotone inclusion paradigm, which unifies generalized low rank priors with deep pseudo contractive denoisers and extends beyond traditional convex optimization. Building on the Davis Yin splitting scheme, we develop the GTCTV DPC algorithm and rigorously establish its global convergence. Extensive experiments demonstrate that GTCTV DPC consistently outperforms existing methods in both quantitative metrics and visual quality, particularly at low sampling rates.
comment: 22 pages, 5 figures
☆ VideoLucy: Deep Memory Backtracking for Long Video Understanding NeurIPS-2025
Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io
comment: NeurIPS-2025 Accepted Paper
☆ Low-Field Magnetic Resonance Image Quality Enhancement using a Conditional Flow Matching Model
This paper introduces a novel framework for image quality transfer based on conditional flow matching (CFM). Unlike conventional generative models that rely on iterative sampling or adversarial objectives, CFM learns a continuous flow between a noise distribution and target data distributions through the direct regression of an optimal velocity field. We evaluate this approach in the context of low-field magnetic resonance imaging (LF-MRI), a rapidly emerging modality that offers affordable and portable scanning but suffers from inherently low signal-to-noise ratio and reduced diagnostic quality. Our framework is designed to reconstruct high-field-like MR images from their corresponding low-field inputs, thereby bridging the quality gap without requiring expensive infrastructure. Experiments demonstrate that CFM not only achieves state-of-the-art performance, but also generalizes robustly to both in-distribution and out-of-distribution data. Importantly, it does so while utilizing significantly fewer parameters than competing deep learning methods. These results underline the potential of CFM as a powerful and scalable tool for MRI reconstruction, particularly in resource-limited clinical environments.
☆ Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda
Urban monitoring of public infrastructure (such as waste bins, road signs, vegetation, sidewalks, and construction sites) poses significant challenges due to the diversity of objects, environments, and contextual conditions involved. Current state-of-the-art approaches typically rely on a combination of IoT sensors and manual inspections, which are costly, difficult to scale, and often misaligned with citizens' perception formed through direct visual observation. This raises a critical question: Can machines now "see" like citizens and infer informed opinions about the condition of urban infrastructure? Vision-Language Models (VLMs), which integrate visual understanding with natural language reasoning, have recently demonstrated impressive capabilities in processing complex visual information, turning them into a promising technology to address this challenge. This systematic review investigates the role of VLMs in urban monitoring, with particular emphasis on zero-shot applications. Following the PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021 and 2025 to address four core research questions: (1) What urban monitoring tasks have been effectively addressed using VLMs? (2) Which VLM architectures and frameworks are most commonly used and demonstrate superior performance? (3) What datasets and resources support this emerging field? (4) How are VLM-based applications evaluated, and what performance levels have been reported?
comment: 44 pages
☆ Scene Coordinate Reconstruction Priors ICCV 2025
Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.
comment: ICCV 2025, Project page: https://nianticspatial.github.io/scr-priors/
☆ Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling
Procedure step recognition (PSR) aims to identify all correctly completed steps and their sequential order in videos of procedural tasks. The existing state-of-the-art models rely solely on detecting assembly object states in individual video frames. By neglecting temporal features, model robustness and accuracy are limited, especially when objects are partially occluded. To overcome these limitations, we propose Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition (STORM-PSR), a dual-stream framework for PSR that leverages both spatial and temporal features. The assembly state detection stream operates effectively with unobstructed views of the object, while the spatio-temporal stream captures both spatial and temporal features to recognize step completions even under partial occlusion. This stream includes a spatial encoder, pre-trained using a novel weakly supervised approach to capture meaningful spatial representations, and a transformer-based temporal encoder that learns how these spatial features relate over time. STORM-PSR is evaluated on the MECCANO and IndustReal datasets, reducing the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively, compared to prior methods. We demonstrate that this reduction in delay is driven by the spatio-temporal stream, which does not rely on unobstructed views of the object to infer completed steps. The code for STORM-PSR, along with the newly annotated MECCANO labels, is made publicly available at https://timschoonbeek.github.io/stormpsr .
comment: 26 pages, 7 figures and 5 tables in the main paper and one figure and table in the appendix. To be published in Computer Vision and Image Understanding
☆ Deep Attention-guided Adaptive Subsampling
Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.
☆ CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion
Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.
☆ Hybrid Gaussian Splatting for Novel Urban View Synthesis ICCV 2025
This paper describes the Qualcomm AI Research solution to the RealADSim-NVS challenge, hosted at the RealADSim Workshop at ICCV 2025. The challenge concerns novel view synthesis in street scenes, and participants are required to generate, starting from car-centric frames captured during some training traversals, renders of the same urban environment as viewed from a different traversal (e.g. different street lane or car direction). Our solution is inspired by hybrid methods in scene generation and generative simulators merging gaussian splatting and diffusion models, and it is composed of two stages: First, we fit a 3D reconstruction of the scene and render novel views as seen from the target cameras. Then, we enhance the resulting frames with a dedicated single-step diffusion model. We discuss specific choices made in the initialization of gaussian primitives as well as the finetuning of the enhancer model and its training data curation. We report the performance of our model design and we ablate its components in terms of novel view quality as measured by PSNR, SSIM and LPIPS. On the public leaderboard reporting test results, our proposal reaches an aggregated score of 0.432, achieving the second place overall.
comment: ICCV 2025 RealADSim Workshop
☆ Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector
Vision Language Models (VLMs) have achieved impressive progress in multimodal reasoning; yet, they remain vulnerable to hallucinations, where outputs are not grounded in visual evidence. In this paper, we investigate a previously overlooked setting: logo hallucination, where models generate brand names or textual content despite logos containing no visible words. Using curated splits of pure symbols, hybrids, and text-bearing logos, as well as the challenging Hard-60 subset, we systematically measure hallucination across leading VLMs. We further probe robustness through nine structured perturbations and show that hallucinations persist even under strong distortions, with occlusion exposing the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA demonstrates that hallucination is tied to a small subset of projector dimensions, and targeted ablation substantially reduces errors while preserving OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic priors rather than genuine glyph perception, particularly for iconic circular logos, and that projector subspaces play a decisive role in this failure mode. Our work contributes both a novel diagnostic lens and actionable mitigation insights, highlighting projector disentanglement and OCR-guided decoding as promising directions for building more trustworthy multimodal systems.
☆ Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval
Almost all previous text-to-video retrieval works ideally assume that videos are pre-trimmed with short durations containing solely text-related content. However, in practice, videos are typically untrimmed in long durations with much more complicated background content. Therefore, in this paper, we focus on the more practical yet challenging task of Partially Relevant Video Retrieval (PRVR), which aims to retrieve partially relevant untrimmed videos with the given query. To tackle this task, we propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model and transfers it to a lightweight, task-specific PRVR network. Specifically, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD++), where a large teacher model provides supervision to a compact dual-branch student network. The student model comprises two branches: an inheritance branch that absorbs transferable knowledge from the teacher, and an exploration branch that learns task-specific information from the PRVR dataset to address domain gaps. To further enhance learning, we incorporate a dynamic soft-target construction mechanism. By replacing rigid hard-target supervision with adaptive soft targets that evolve during training, our method enables the model to better capture the fine-grained, partial relevance between videos and queries. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR. The code is available at https://github.com/HuiGuanLab/DL-DKD.
☆ PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes
Reconstructing dynamic 3D urban scenes is crucial for autonomous driving, yet current methods face a stark trade-off between fidelity and computational cost. This inefficiency stems from their semantically agnostic design, which allocates resources uniformly, treating static backgrounds and safety-critical objects with equal importance. To address this, we introduce Priority-Adaptive Gaussian Splatting (PAGS), a framework that injects task-aware semantic priorities directly into the 3D reconstruction and rendering pipeline. PAGS introduces two core contributions: (1) Semantically-Guided Pruning and Regularization strategy, which employs a hybrid importance metric to aggressively simplify non-critical scene elements while preserving fine-grained details on objects vital for navigation. (2) Priority-Driven Rendering pipeline, which employs a priority-based depth pre-pass to aggressively cull occluded primitives and accelerate the final shading computations. Extensive experiments on the Waymo and KITTI datasets demonstrate that PAGS achieves exceptional reconstruction quality, particularly on safety-critical objects, while significantly reducing training time and boosting rendering speeds to over 350 FPS.
☆ SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis
With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.
comment: Proceedings of the 33rd ACM International Conference on Multimedia,ACMMM 2025 Dataset Track
☆ AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion
Visible-infrared image fusion is crucial in key applications such as autonomous driving and nighttime surveillance. Its main goal is to integrate multimodal information to produce enhanced images that are better suited for downstream tasks. Although deep learning based fusion methods have made significant progress, mainstream unsupervised approaches still face serious challenges in practical applications. Existing methods mostly rely on manually designed loss functions to guide the fusion process. However, these loss functions have obvious limitations. On one hand, the reference images constructed by existing methods often lack details and have uneven brightness. On the other hand, the widely used gradient losses focus only on gradient magnitude. To address these challenges, this paper proposes an angle-based perception framework for spatial-sensitive image fusion (AngularFuse). At first, we design a cross-modal complementary mask module to force the network to learn complementary information between modalities. Then, a fine-grained reference image synthesis strategy is introduced. By combining Laplacian edge enhancement with adaptive histogram equalization, reference images with richer details and more balanced brightness are generated. Last but not least, we introduce an angle-aware loss, which for the first time constrains both gradient magnitude and direction simultaneously in the gradient domain. AngularFuse ensures that the fused images preserve both texture intensity and correct edge orientation. Comprehensive experiments on the MSRS, RoadScene, and M3FD public datasets show that AngularFuse outperforms existing mainstream methods with clear margin. Visual comparisons further confirm that our method produces sharper and more detailed results in challenging scenes, demonstrating superior fusion capability.
comment: For the first time, angle-based perception was introduced into the multi-modality image fusion task
Local Background Features Matter in Out-of-Distribution Detection
Out-of-distribution (OOD) detection is crucial when deploying deep neural networks in the real world to ensure the reliability and safety of their applications. One main challenge in OOD detection is that neural network models often produce overconfident predictions on OOD data. While some methods using auxiliary OOD datasets or generating fake OOD images have shown promising OOD detection performance, they are limited by the high costs of data collection and training. In this study, we propose a novel and effective OOD detection method that utilizes local background features as fake OOD features for model training. Inspired by the observation that OOD images generally share similar background regions with ID images, the background features are extracted from ID images as simulated OOD visual representations during training based on the local invariance of convolution. Through being optimized to reduce the $L_2$-norm of these background features, the neural networks are able to alleviate the overconfidence issue on OOD data. Extensive experiments on multiple standard OOD detection benchmarks confirm the effectiveness of our method and its wide combinatorial compatibility with existing post-hoc methods, with new state-of-the-art performance achieved from our method.
☆ Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images ICCV2025
We propose two novel loss functions, Multiplicative Loss and Confidence-Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence-Adaptive Multiplicative Loss applies a confidence-driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter-free mechanism for robust segmentation under challenging data limitations.
comment: Accepted by ICCV2025 Workshop "Third Workshop on Computer Vision for Automated Medical Diagnosis"
☆ Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding
Current video representations heavily rely on unstable and over-grained priors for motion and appearance modelling, \emph{i.e.}, pixel-level matching and tracking. A tracking error of just a few pixels would lead to the collapse of the visual object representation, not to mention occlusions and large motion frequently occurring in videos. To overcome the above mentioned vulnerability, this work proposes spatio-temporally consistent proxy nodes to represent dynamically changing objects/scenes in the video. On the one hand, the hierarchical proxy nodes have the ability to stably express the multi-scale structure of visual objects, so they are not affected by accumulated tracking error, long-term motion, occlusion, and viewpoint variation. On the other hand, the dynamic representation update mechanism of the proxy nodes adequately leverages spatio-temporal priors of the video to mitigate the impact of inaccurate trackers, thereby effectively handling drastic changes in scenes and objects. Additionally, the decoupled encoding manner of the shape and texture representations across different visual objects in the video facilitates controllable and fine-grained appearance editing capability. Extensive experiments demonstrate that the proposed representation achieves high video reconstruction accuracy with fewer parameters and supports complex video processing tasks, including video in-painting and keyframe-based temporally consistent video editing.
☆ Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection
In the multimedia domain, Infrared Small Target Detection (ISTD) plays a important role in drone-based multi-modality sensing. To address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD, we propose a doubly wavelet-guided Invariance learning framework(Ivan-ISTD). In the first stage, we generate training samples aligned with the target domain using Wavelet-guided Cross-domain Synthesis. This wavelet-guided alignment machine accurately separates the target background through multi-frequency wavelet filtering. In the second stage, we introduce Real-domain Noise Invariance Learning, which extracts real noise characteristics from the target domain to build a dynamic noise library. The model learns noise invariance through self-supervised loss, thereby overcoming the limitations of distribution bias in traditional artificial noise modeling. Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic degradation dataset that simulates the distribution shifts encountered in real-world applications. Additionally, we validate the versatility of our method using other real-world datasets. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in terms of many quantitative metrics. In particular, Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios. The code for this work can be found at: https://github.com/nanjin1/Ivan-ISTD.
comment: In infrared small target detection, noise from different sensors can cause significant interference to performance. We propose a new dataset and a wavelet-guided Invariance learning framework(Ivan-ISTD) to emphasize this issue
☆ BIGFix: Bidirectional Image Generation with Token Fixing
Recent advances in image and video generation have raised significant interest from both academia and industry. A key challenge in this field is improving inference efficiency, as model size and the number of inference steps directly impact the commercial viability of generative models while also posing fundamental scientific challenges. A promising direction involves combining auto-regressive sequential token modeling with multi-token prediction per step, reducing inference time by up to an order of magnitude. However, predicting multiple tokens in parallel can introduce structural inconsistencies due to token incompatibilities, as capturing complex joint dependencies during training remains challenging. Traditionally, once tokens are sampled, there is no mechanism to backtrack and refine erroneous predictions. We propose a method for self-correcting image generation by iteratively refining sampled tokens. We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling. Our method preserves the efficiency benefits of parallel token prediction while significantly enhancing generation quality. We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.
☆ HoneyBee: Data Recipes for Vision-Language Reasoners
Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.
comment: 32 pages
☆ DIANet: A Phase-Aware Dual-Stream Network for Micro-Expression Recognition via Dynamic Images
Micro-expressions are brief, involuntary facial movements that typically last less than half a second and often reveal genuine emotions. Accurately recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. However, micro-expression recognition (MER) remains a challenging task due to the subtle and transient nature of facial cues and the limited availability of annotated data. While dynamic image (DI) representations have been introduced to summarize temporal motion into a single frame, conventional DI-based methods often overlook the distinct characteristics of different temporal phases within a micro-expression. To address this issue, this paper proposes a novel dual-stream framework, DIANet, which leverages phase-aware dynamic images - one encoding the onset-to-apex phase and the other capturing the apex-to-offset phase. Each stream is processed by a dedicated convolutional neural network, and a cross-attention fusion module is employed to adaptively integrate features from both streams based on their contextual relevance. Extensive experiments conducted on three benchmark MER datasets (CASME-II, SAMM, and MMEW) demonstrate that the proposed method consistently outperforms conventional single-phase DI-based approaches. The results highlight the importance of modeling temporal phase information explicitly and suggest a promising direction for advancing MER.
☆ The Impact of Synthetic Data on Object Detection Model Performance: A Comparative Analysis with Real-World Data
Recent advances in generative AI, particularly in computer vision (CV), offer new opportunities to optimize workflows across industries, including logistics and manufacturing. However, many AI applications are limited by a lack of expertise and resources, which forces a reliance on general-purpose models. Success with these models often requires domain-specific data for fine-tuning, which can be costly and inefficient. Thus, using synthetic data for fine-tuning is a popular, cost-effective alternative to gathering real-world data. This work investigates the impact of synthetic data on the performance of object detection models, compared to models trained on real-world data only, specifically within the domain of warehouse logistics. To this end, we examined the impact of synthetic data generated using the NVIDIA Omniverse Replicator tool on the effectiveness of object detection models in real-world scenarios. It comprises experiments focused on pallet detection in a warehouse setting, utilizing both real and various synthetic dataset generation strategies. Our findings provide valuable insights into the practical applications of synthetic image data in computer vision, suggesting that a balanced integration of synthetic and real data can lead to robust and efficient object detection models.
comment: 18 pages, 12 figures, 2 tables. Code: https://github.com/MuammerBay/omniverse-replicator-sim2real-analysis ; Data: https://doi.org/10.5281/zenodo.17308406
☆ Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos ICCV 2025
Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.
comment: 2nd Place Winner, ICCV 2025 2COOOL Competition
☆ CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.
comment: Preprint. Under Review
☆ BEEP3D: Box-Supervised End-to-End Pseudo-Mask Generation for 3D Instance Segmentation
3D instance segmentation is crucial for understanding complex 3D environments, yet fully supervised methods require dense point-level annotations, resulting in substantial annotation costs and labor overhead. To mitigate this, box-level annotations have been explored as a weaker but more scalable form of supervision. However, box annotations inherently introduce ambiguity in overlapping regions, making accurate point-to-instance assignment challenging. Recent methods address this ambiguity by generating pseudo-masks through training a dedicated pseudo-labeler in an additional training stage. However, such two-stage pipelines often increase overall training time and complexity, hinder end-to-end optimization. To overcome these challenges, we propose BEEP3D-Box-supervised End-to-End Pseudo-mask generation for 3D instance segmentation. BEEP3D adopts a student-teacher framework, where the teacher model serves as a pseudo-labeler and is updated by the student model via an Exponential Moving Average. To better guide the teacher model to generate precise pseudo-masks, we introduce an instance center-based query refinement that enhances position query localization and leverages features near instance centers. Additionally, we design two novel losses-query consistency loss and masked feature consistency loss-to align semantic and geometric signals between predictions and pseudo-masks. Extensive experiments on ScanNetV2 and S3DIS datasets demonstrate that BEEP3D achieves competitive or superior performance compared to state-of-the-art weakly supervised methods while remaining computationally efficient.
☆ UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering
In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.
☆ State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.
☆ DPL: Spatial-Conditioned Diffusion Prototype Enhancement for One-Shot Medical Segmentation
One-shot medical image segmentation faces fundamental challenges in prototype representation due to limited annotated data and significant anatomical variability across patients. Traditional prototype-based methods rely on deterministic averaging of support features, creating brittle representations that fail to capture intra-class diversity essential for robust generalization. This work introduces Diffusion Prototype Learning (DPL), a novel framework that reformulates prototype construction through diffusion-based feature space exploration. DPL models one-shot prototypes as learnable probability distributions, enabling controlled generation of diverse yet semantically coherent prototype variants from minimal labeled data. The framework operates through three core innovations: (1) a diffusion-based prototype enhancement module that transforms single support prototypes into diverse variant sets via forward-reverse diffusion processes, (2) a spatial-aware conditioning mechanism that leverages geometric properties derived from prototype feature statistics, and (3) a conservative fusion strategy that preserves prototype fidelity while maximizing representational diversity. DPL ensures training-inference consistency by using the same diffusion enhancement and fusion pipeline in both phases. This process generates enhanced prototypes that serve as the final representations for similarity calculations, while the diffusion process itself acts as a regularizer. Extensive experiments on abdominal MRI and CT datasets demonstrate significant improvements respectively, establishing new state-of-the-art performance in one-shot medical image segmentation.
comment: Accepted at IVCNZ 2025. To be published in IEEE proceedings
☆ Class-aware Domain Knowledge Fusion and Fission for Continual Test-Time Adaptation
Continual Test-Time Adaptation (CTTA) aims to quickly fine-tune the model during the test phase so that it can adapt to multiple unknown downstream domain distributions without pre-acquiring downstream domain data. To this end, existing advanced CTTA methods mainly reduce the catastrophic forgetting of historical knowledge caused by irregular switching of downstream domain data by restoring the initial model or reusing historical models. However, these methods are usually accompanied by serious insufficient learning of new knowledge and interference from potentially harmful historical knowledge, resulting in severe performance degradation. To this end, we propose a class-aware domain Knowledge Fusion and Fission method for continual test-time adaptation, called KFF, which adaptively expands and merges class-aware domain knowledge in old and new domains according to the test-time data from different domains, where discriminative historical knowledge can be dynamically accumulated. Specifically, considering the huge domain gap within streaming data, a domain Knowledge FIssion (KFI) module is designed to adaptively separate new domain knowledge from a paired class-aware domain prompt pool, alleviating the impact of negative knowledge brought by old domains that are distinct from the current domain. Besides, to avoid the cumulative computation and storage overheads from continuously fissioning new knowledge, a domain Knowledge FUsion (KFU) module is further designed to merge the fissioned new knowledge into the existing knowledge pool with minimal cost, where a greedy knowledge dynamic merging strategy is designed to improve the compatibility of new and old knowledge while keeping the computational efficiency. Extensive experiments on the ImageNet-C dataset verify the effectiveness of our proposed method against other methods.
☆ MAPS: Masked Attribution-based Probing of Strategies- A computational framework to align human and model explanations
Human core object recognition depends on the selective use of visual information, but the strategies guiding these choices are difficult to measure directly. We present MAPS (Masked Attribution-based Probing of Strategies), a behaviorally validated computational tool that tests whether explanations derived from artificial neural networks (ANNs) can also explain human vision. MAPS converts attribution maps into explanation-masked images (EMIs) and compares image-by-image human accuracies on these minimal images with limited pixel budgets with accuracies on the full stimuli. MAPS provides a principled way to evaluate and choose among competing ANN interpretability methods. In silico, EMI-based behavioral similarity between models reliably recovers the ground-truth similarity computed from their attribution maps, establishing which explanation methods best capture the model's strategy. When applied to humans and macaques, MAPS identifies ANN-explanation combinations whose explanations align most closely with biological vision, achieving the behavioral validity of Bubble masks while requiring far fewer behavioral trials. Because it needs only access to model attributions and a modest set of behavioral data on the original images, MAPS avoids exhaustive psychophysics while offering a scalable tool for adjudicating explanations and linking human behavior, neural activity, and model decisions under a common standard.
☆ FedHUG: Federated Heterogeneous Unsupervised Generalization for Remote Physiological Measurements
Remote physiological measurement gained wide attention, while it requires collecting users' privacy-sensitive information, and existing contactless measurements still rely on labeled client data. This presents challenges when we want to further update real-world deployed models with numerous user data lacking labels. To resolve these challenges, we instantiate a new protocol called Federated Unsupervised Domain Generalization (FUDG) in this work. Subsequently, the \textbf{Fed}erated \textbf{H}eterogeneous \textbf{U}nsupervised \textbf{G}eneralization (\textbf{FedHUG}) framework is proposed and consists of: (1) Minimal Bias Aggregation module dynamically adjusts aggregation weights based on prior-driven bias evaluation to cope with heterogeneous non-IID features from multiple domains. (2) The Global Distribution-aware Learning Controller parameterizes the label distribution and dynamically manipulates client-specific training strategies, thereby mitigating the server-client label distribution skew and long-tail issue. The proposal shows superior performance across state-of-the-art techniques in estimation with either RGB video or mmWave radar. The code will be released.
☆ MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.
☆ Hardware-aware Coding Function Design for Compressive Single-Photon 3D Cameras
Single-photon cameras are becoming increasingly popular in time-of-flight 3D imaging because they can time-tag individual photons with extreme resolution. However, their performance is susceptible to hardware limitations, such as system bandwidth, maximum laser power, sensor data rates, and in-sensor memory and compute resources. Compressive histograms were recently introduced as a solution to the challenge of data rates through an online in-sensor compression of photon timestamp data. Although compressive histograms work within limited in-sensor memory and computational resources, they underperform when subjected to real-world illumination hardware constraints. To address this, we present a constrained optimization approach for designing practical coding functions for compressive single-photon 3D imaging. Using gradient descent, we jointly optimize an illumination and coding matrix (i.e., the coding functions) that adheres to hardware constraints. We show through extensive simulations that our coding functions consistently outperform traditional coding designs under both bandwidth and peak power constraints. This advantage is particularly pronounced in systems constrained by peak power. Finally, we show that our approach adapts to arbitrary parameterized impulse responses by evaluating it on a real-world system with a non-ideal impulse response function.
comment: IEEE TPAMI Special Issue
☆ ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation NeurIPS 2025
The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking approaches face limitations in RAIG systems, as the complex feature extraction and recombination processes fail to preserve watermark signals during generation. To address these challenges, we propose ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our framework synthesizes sentinel images that maintain visual consistency with the original dataset. These sentinels enable protection verification through randomly generated character sequences that serve as retrieval keys. To ensure seamless integration, we leverage vision-language models to generate the sentinel images. Experimental results demonstrate that ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications. Code is available at https://github.com/luo-ziyuan/ImageSentinel.
comment: Accepted at NeurIPS 2025
☆ Self-Supervised Selective-Guided Diffusion Model for Old-Photo Face Restoration
Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: https://github.com/PRIS-CV/SSDiff.
☆ DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning
With the excellent representation capabilities of Pre-Trained Models (PTMs), remarkable progress has been made in non-rehearsal Class-Incremental Learning (CIL) research. However, it remains an extremely challenging task due to three conundrums: increasingly large model complexity, non-smooth representation shift during incremental learning and inconsistency between stage-wise sub-problem optimization and global inference. In this work, we propose the Discriminative Representation Learning (DRL) framework to specifically address these challenges. To conduct incremental learning effectively and yet efficiently, the DRL's network, called Incremental Parallel Adapter (IPA) network, is built upon a PTM and increasingly augments the model by learning a lightweight adapter with a small amount of parameter learning overhead in each incremental stage. The adapter is responsible for adapting the model to new classes, it can inherit and propagate the representation capability from the current model through parallel connection between them by a transfer gate. As a result, this design guarantees a smooth representation shift between different incremental stages. Furthermore, to alleviate inconsistency and enable comparable feature representations across incremental stages, we design the Decoupled Anchor Supervision (DAS). It decouples constraints of positive and negative samples by respectively comparing them with the virtual anchor. This decoupling promotes discriminative representation learning and aligns the feature spaces learned at different stages, thereby narrowing the gap between stage-wise local optimization over a subset of data and global inference across all classes. Extensive experiments on six benchmarks reveal that our DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period while maintaining high efficiency in both training and inference phases.
comment: 13 pages, 7 figures
Gaussian Semantic Field for One-shot LiDAR Global Localization
We present a one-shot LiDAR global localization algorithm featuring semantic disambiguation ability based on a lightweight tri-layered scene graph. While landmark semantic registration-based methods have shown promising performance improvements in global localization compared with geometric-only methods, landmarks can be repetitive and misleading for correspondence establishment. We propose to mitigate this problem by modeling semantic distributions with continuous functions learned from a population of Gaussian processes. Compared with discrete semantic labels, the continuous functions capture finer-grained geo-semantic information and also provide more detailed metric information for correspondence establishment. We insert this continuous function as the middle layer between the object layer and the metric-semantic layer, forming a tri-layered 3D scene graph, serving as a light-weight yet performant backend for one-shot localization. We term our global localization pipeline Outram-GSF (Gaussian semantic field) and conduct a wide range of experiments on publicly available data sets, validating the superior performance against the current state-of-the-art.
☆ G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior
Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at https://dali-jack.github.io/g4splat-web/.
comment: Project page: https://dali-jack.github.io/g4splat-web/
☆ An Adaptive Edge-Guided Dual-Network Framework for Fast QR Code Motion Deblurring
Unlike general image deblurring that prioritizes perceptual quality, QR code deblurring focuses on ensuring successful decoding. QR codes are characterized by highly structured patterns with sharp edges, a robust prior for restoration. Yet existing deep learning methods rarely exploit these priors explicitly. To address this gap, we propose the Edge-Guided Attention Block (EGAB), which embeds explicit edge priors into a Transformer architecture. Based on EGAB, we develop Edge-Guided Restormer (EG-Restormer), an effective network that significantly boosts the decoding rate of severely blurred QR codes. For mildly blurred inputs, we design the Lightweight and Efficient Network (LENet) for fast deblurring. We further integrate these two networks into an Adaptive Dual-network (ADNet), which dynamically selects the suitable network based on input blur severity, making it ideal for resource-constrained mobile devices. Extensive experiments show that our EG-Restormer and ADNet achieve state-of-the-art performance with a competitive speed. Project page: https://github.com/leejianping/ADNet
☆ IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation
In this study, we present IL3D, a large-scale dataset meticulously designed for large language model (LLM)-driven 3D scene generation, addressing the pressing demand for diverse, high-quality training data in indoor layout design. Comprising 27,816 indoor layouts across 18 prevalent room types and a library of 29,215 high-fidelity 3D object assets, IL3D is enriched with instance-level natural language annotations to support robust multimodal learning for vision-language tasks. We establish rigorous benchmarks to evaluate LLM-driven scene generation. Experimental results show that supervised fine-tuning (SFT) of LLMs on IL3D significantly improves generalization and surpasses the performance of SFT on other datasets. IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks. As a versatile and robust resource, IL3D significantly advances research in 3D scene generation and embodied intelligence, by providing high-fidelity scene data to support environment perception tasks of embodied agents.
comment: 9 pages main paper; 15 pages references and appendix
☆ Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback
Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.
☆ A Review on Domain Adaption and Generative Adversarial Networks(GANs)
The major challenge in today's computer vision scenario is the availability of good quality labeled data. In a field of study like image classification, where data is of utmost importance, we need to find more reliable methods which can overcome the scarcity of data to produce results comparable to previous benchmark results. In most cases, obtaining labeled data is very difficult because of the high cost of human labor and in some cases impossible. The purpose of this paper is to discuss Domain Adaptation and various methods to implement it. The main idea is to use a model trained on a particular dataset to predict on data from a different domain of the same kind, for example - a model trained on paintings of airplanes predicting on real images of airplanes
☆ VIDMP3: Video Editing by Representing Motion with Pose and Position Priors
Motion-preserved video editing is crucial for creators, particularly in scenarios that demand flexibility in both the structure and semantics of swapped objects. Despite its potential, this area remains underexplored. Existing diffusion-based editing methods excel in structure-preserving tasks, using dense guidance signals to ensure content integrity. While some recent methods attempt to address structure-variable editing, they often suffer from issues such as temporal inconsistency, subject identity drift, and the need for human intervention. To address these challenges, we introduce VidMP3, a novel approach that leverages pose and position priors to learn a generalized motion representation from source videos. Our method enables the generation of new videos that maintain the original motion while allowing for structural and semantic flexibility. Both qualitative and quantitative evaluations demonstrate the superiority of our approach over existing methods. The code will be made publicly available at https://github.com/sandeep-sm/VidMP3.
☆ Your VAR Model is Secretly an Efficient and Explainable Generative Classifier
Generative classifiers, which leverage conditional generative models for classification, have recently demonstrated desirable properties such as robustness to distribution shifts. However, recent progress in this area has been largely driven by diffusion-based models, whose substantial computational cost severely limits scalability. This exclusive focus on diffusion-based methods has also constrained our understanding of generative classifiers. In this work, we propose a novel generative classifier built on recent advances in visual autoregressive (VAR) modeling, which offers a new perspective for studying generative classifiers. To further enhance its performance, we introduce the Adaptive VAR Classifier$^+$ (A-VARC$^+$), which achieves a superior trade-off between accuracy and inference speed, thereby significantly improving practical applicability. Moreover, we show that the VAR-based method exhibits fundamentally different properties from diffusion-based methods. In particular, due to its tractable likelihood, the VAR-based classifier enables visual explainability via token-wise mutual information and demonstrates inherent resistance to catastrophic forgetting in class-incremental learning tasks.
☆ APGNet: Adaptive Prior-Guided for Underwater Camouflaged Object Detection ACM MM
Detecting camouflaged objects in underwater environments is crucial for marine ecological research and resource exploration. However, existing methods face two key challenges: underwater image degradation, including low contrast and color distortion, and the natural camouflage of marine organisms. Traditional image enhancement techniques struggle to restore critical features in degraded images, while camouflaged object detection (COD) methods developed for terrestrial scenes often fail to adapt to underwater environments due to the lack of consideration for underwater optical characteristics. To address these issues, we propose APGNet, an Adaptive Prior-Guided Network, which integrates a Siamese architecture with a novel prior-guided mechanism to enhance robustness and detection accuracy. First, we employ the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm for data augmentation, generating illumination-invariant images to mitigate degradation effects. Second, we design an Extended Receptive Field (ERF) module combined with a Multi-Scale Progressive Decoder (MPD) to capture multi-scale contextual information and refine feature representations. Furthermore, we propose an adaptive prior-guided mechanism that hierarchically fuses position and boundary priors by embedding spatial attention in high-level features for coarse localization and using deformable convolution to refine contours in low-level features. Extensive experimental results on two public MAS datasets demonstrate that our proposed method APGNet outperforms 15 state-of-art methods under widely used evaluation metrics.
comment: 6 pages. accepted by ACM MM Asia 2025
☆ One Dimensional CNN ECG Mamba for Multilabel Abnormality Classification in 12 Lead ECG
Accurate detection of cardiac abnormalities from electrocardiogram recordings is regarded as essential for clinical diagnostics and decision support. Traditional deep learning models such as residual networks and transformer architectures have been applied successfully to this task, but their performance has been limited when long sequential signals are processed. Recently, state space models have been introduced as an efficient alternative. In this study, a hybrid framework named One Dimensional Convolutional Neural Network Electrocardiogram Mamba is introduced, in which convolutional feature extraction is combined with Mamba, a selective state space model designed for effective sequence modeling. The model is built upon Vision Mamba, a bidirectional variant through which the representation of temporal dependencies in electrocardiogram data is enhanced. Comprehensive experiments on the PhysioNet Computing in Cardiology Challenges of 2020 and 2021 were conducted, and superior performance compared with existing methods was achieved. Specifically, the proposed model achieved substantially higher AUPRC and AUROC scores than those reported by the best previously published algorithms on twelve lead electrocardiograms. These results demonstrate the potential of Mamba-based architectures to advance reliable ECG classification. This capability supports early diagnosis and personalized treatment, while enhancing accessibility in telemedicine and resource-constrained healthcare systems.
comment: 6 Pages, 2 figures
☆ SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion
Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.
comment: 15 pages
☆ SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models
Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences. To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations. Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models. Please refer to https://videobench.github.io/SeqBench.github.io/ for more details.
☆ SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.
☆ UNCAP: Uncertainty-Guided Planning Using Natural Language Communication for Cooperative Autonomous Vehicles
Safe large-scale coordination of multiple cooperative connected autonomous vehicles (CAVs) hinges on communication that is both efficient and interpretable. Existing approaches either rely on transmitting high-bandwidth raw sensor data streams or neglect perception and planning uncertainties inherent in shared data, resulting in systems that are neither scalable nor safe. To address these limitations, we propose Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP), a vision-language model-based planning approach that enables CAVs to communicate via lightweight natural language messages while explicitly accounting for perception uncertainty in decision-making. UNCAP features a two-stage communication protocol: (i) an ego CAV first identifies the subset of vehicles most relevant for information exchange, and (ii) the selected CAVs then transmit messages that quantitatively express their perception uncertainty. By selectively fusing messages that maximize mutual information, this strategy allows the ego vehicle to integrate only the most relevant signals into its decision-making, improving both the scalability and reliability of cooperative planning. Experiments across diverse driving scenarios show a 63% reduction in communication bandwidth with a 31% increase in driving safety score, a 61% reduction in decision uncertainty, and a four-fold increase in collision distance margin during near-miss events. Project website: https://uncap-project.github.io/
Scope: Selective Cross-modal Orchestration of Visual Perception Experts
Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49\%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.
comment: 14 pages, 2 figures
☆ CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models
We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.
comment: 8 pages, 3 figures. Endorsed by Dr. Seyedmorteza Sadat (ETH Zurich). The work introduces CADE 2.5 with ZeResFDG as a practical inference-time guidance stack for SD/SDXL. Code and visual examples to be released on GitHub and Hugging Face
☆ Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
☆ Unifying Vision-Language Latents for Zero-label Image Caption Enhancement NeurIPS 2025
Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.
comment: Accepted to PMLR and NeurIPS 2025 UniReps
☆ Robust Plant Disease Diagnosis with Few Target-Domain Samples
Various deep learning-based systems have been proposed for accurate and convenient plant disease diagnosis, achieving impressive performance. However, recent studies show that these systems often fail to maintain diagnostic accuracy on images captured under different conditions from the training environment -- an essential criterion for model robustness. Many deep learning methods have shown high accuracy in plant disease diagnosis. However, they often struggle to generalize to images taken in conditions that differ from the training setting. This drop in performance stems from the subtle variability of disease symptoms and domain gaps -- differences in image context and environment. The root cause is the limited diversity of training data relative to task complexity, making even advanced models vulnerable in unseen domains. To tackle this challenge, we propose a simple yet highly adaptable learning framework called Target-Aware Metric Learning with Prioritized Sampling (TMPS), grounded in metric learning. TMPS operates under the assumption of access to a limited number of labeled samples from the target (deployment) domain and leverages these samples effectively to improve diagnostic robustness. We assess TMPS on a large-scale automated plant disease diagnostic task using a dataset comprising 223,073 leaf images sourced from 23 agricultural fields, spanning 21 diseases and healthy instances across three crop species. By incorporating just 10 target domain samples per disease into training, TMPS surpasses models trained using the same combined source and target samples, and those fine-tuned with these target samples after pre-training on source data. It achieves average macro F1 score improvements of 7.3 and 3.6 points, respectively, and a remarkable 18.7 and 17.1 point improvement over the baseline and conventional metric learning.
comment: 7 pages, 2 figures. Accepted at the IEEE International Conference on Visual Communications and Image Processing (VCIP) 2025. Extended version
☆ State-Change Learning for Prediction of Future Events in Endoscopic Videos
Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.
comment: 24 pages, 13 figures
☆ SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.
comment: Project page: https://research.nvidia.com/labs/sil/projects/simuli
☆ Learning to Grasp Anything by Playing with Random Toys
Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these "toys" enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation. Demonstration videos, code, checkpoints and our dataset are available on our project page: https://lego-grasp.github.io/ .
♻ ☆ Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping
Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured with overall accuracy (OA), average accuracy (AA), and F1-score. HyperSigma achieved an OA of 34.5% (+/- 1.8%), DOFA reached 62.6% (+/- 3.5%), and the SpectralEarth model achieved an OA of 93.5% (+/- 0.8%). A compact SpectralEarth variant trained from scratch achieved 91%, highlighting the importance of model architecture for strong generalization across geographic regions and sensor platforms. These results provide a systematic evaluation of foundation models for operational hyperspectral crop mapping and outline directions for future model development.
comment: currently being reviewed for WHISPERS conference ( Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing )
♻ ☆ Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
♻ ☆ AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model
In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning (QALFT) framework to facilitate efficient task adaptation and model compression during mobile-side deployment of AndesVL. Moreover, utilizing our cache eviction algorithm -- OKV -- along with customized speculative decoding and compression strategies, we achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips. We release all models on https://huggingface.co/OPPOer.
comment: Tech report of OPPO AndesVL Team
♻ ☆ Towards Robust and Realible Multimodal Misinformation Recognition with Incomplete Modality
Multimodal Misinformation Recognition has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of misinformation recognition in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.
♻ ☆ REACT3D: Recovering Articulations for Interactive Physical 3D Scenes
Interactive 3D scenes are increasingly vital for embodied intelligence, yet existing datasets remain limited due to the labor-intensive process of annotating part segmentation, kinematic types, and motion trajectories. We present REACT3D, a scalable zero-shot framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry, enabling direct use in diverse downstream tasks. Our contributions include: (i) openable-object detection and segmentation to extract candidate movable parts from static scenes, (ii) articulation estimation that infers joint types and motion parameters, (iii) hidden-geometry completion followed by interactive object assembly, and (iv) interactive scene integration in widely supported formats to ensure compatibility with standard simulation platforms. We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes, demonstrating the effectiveness of our framework and providing a practical foundation for scalable interactive scene generation, thereby lowering the barrier to large-scale research on articulated scene understanding. Our project page is https://react3d.github.io/
comment: 8 pages
♻ ☆ FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models NeurIPS 2025
Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs' adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model's associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.
comment: 19 pages, 11 figures. Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Optimally Deep Networks -- Adapting Model Depth to Datasets for Superior Efficiency
Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.
comment: 6 pages, 3 figures, 1 table
♻ ☆ GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
comment: Tables 6 and Figures 8. https://mustansarfiaz.github.io/GeoVLM-R1/
♻ ☆ VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
♻ ☆ Constructing a Real-World Benchmark for Early Wildfire Detection with the New PYRONEAR-2025 Dataset
Early wildfire detection (EWD) is of the utmost importance to enable rapid response efforts, and thus minimize the negative impacts of wildfire spreads. To this end, we present PYRONEAR-2025, a new dataset composed of both images and videos, allowing for the training and evaluation of smoke plume detection models, including sequential models. The data is sourced from: (i) web-scraped videos of wildfires from public networks of cameras for wildfire detection in-the-wild, (ii) videos from our in-house network of cameras, and (iii) a small portion of synthetic and real images. This dataset includes around 150,000 manual annotations on 50,000 images, covering 640 wildfires, PYRONEAR-2025 surpasses existing datasets in size and diversity. It includes data from France, Spain, Chile and the United States. Finally, it is composed of both images and videos, allowing for the training and evaluation of smoke plume detection models, including sequential models. We ran cross-dataset experiments using a lightweight state-of-the-art object detection model, as the ones used in-real-life, and found out the proposed dataset is particularly challenging, with F1 score of around 70\%, but more stable than existing datasets. Finally, its use in concordance with other public datasets helps to reach higher results overall. Last but not least, the video part of the dataset can be used to train a lightweight sequential model, improving global recall while maintaining precision for earlier detections. [We make both our code and data available online](https://github.com/joseg20/wildfires2025).
comment: Preprint of ongoing work
♻ ☆ DarkIR: Robust Low-Light Image Restoration CVPR 2025
Photography during night or in dark conditions typically suffers from noise, low light and blurring issues due to the dim environment and the common use of long exposure. Although Deblurring and Low-light Image Enhancement (LLIE) are related under these conditions, most approaches in image restoration solve these tasks separately. In this paper, we present an efficient and robust neural network for multi-task low-light image restoration. Instead of following the current tendency of Transformer-based models, we propose new attention mechanisms to enhance the receptive field of efficient CNNs. Our method reduces the computational costs in terms of parameters and MAC operations compared to previous methods. Our model, DarkIR, achieves new state-of-the-art results on the popular LOLBlur, LOLv2 and Real-LOLBlur datasets, being able to generalize on real-world night and dark images. Code and models at https://github.com/cidautai/DarkIR
comment: CVPR 2025
♻ ☆ KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging
KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at https://github.com/vboussot/KonfAI.
comment: https://github.com/vboussot/KonfAI
♻ ☆ How to Train Your Metamorphic Deep Neural Network
Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: https://github.com/TSommariva/HTTY_NeuMeta.
comment: Accepted with an Honorable Mention Award at ICIAP 2025 (Rome, Italy). 14 pages, 7 figures
♻ ☆ Modular Embedding Recomposition for Incremental Learning BMVC 2025
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
comment: Accepted to the 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK
♻ ☆ SAIP-Net: Enhancing Remote Sensing Image Segmentation via Spectral Adaptive Information Propagation
Semantic segmentation of remote sensing imagery demands precise spatial boundaries and robust intra-class consistency, challenging conventional hierarchical models. To address limitations arising from spatial domain feature fusion and insufficient receptive fields, this paper introduces SAIP-Net, a novel frequency-aware segmentation framework that leverages Spectral Adaptive Information Propagation. SAIP-Net employs adaptive frequency filtering and multi-scale receptive field enhancement to effectively suppress intra-class feature inconsistencies and sharpen boundary lines. Comprehensive experiments demonstrate significant performance improvements over state-of-the-art methods, highlighting the effectiveness of spectral-adaptive strategies combined with expanded receptive fields for remote sensing image segmentation.
♻ ☆ Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning
Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.
comment: 33 pages, 9 figures
♻ ☆ UrbanTwin: Building High-Fidelity Digital Twins for Sim2Real LiDAR Perception and Evaluation
LiDAR-based perception in intelligent transportation systems (ITS) relies on deep neural networks trained with large-scale labeled datasets. However, creating such datasets is expensive, time-consuming, and labor-intensive, limiting the scalability of perception systems. Sim2Real learning offers a scalable alternative, but its success depends on the simulation's fidelity to real-world environments, dynamics, and sensors. This tutorial introduces a reproducible workflow for building high-fidelity digital twins (HiFi DTs) to generate realistic synthetic datasets. We outline practical steps for modeling static geometry, road infrastructure, and dynamic traffic using open-source resources such as satellite imagery, OpenStreetMap, and sensor specifications. The resulting environments support scalable and cost-effective data generation for robust Sim2Real learning. Using this workflow, we have released three synthetic LiDAR datasets, namely UT-LUMPI, UT-V2X-Real, and UT-TUMTraf-I, which closely replicate real locations and outperform real-data-trained baselines in perception tasks. This guide enables broader adoption of HiFi DTs in ITS research and deployment.
♻ ☆ J-RAS: Enhancing Medical Image Segmentation via Retrieval-Augmented Joint Training
Image segmentation, the process of dividing images into meaningful regions, is critical in medical applications for accurate diagnosis, treatment planning, and disease monitoring. Although manual segmentation by healthcare professionals produces precise outcomes, it is time-consuming, costly, and prone to variability due to differences in human expertise. Artificial intelligence (AI)-based methods have been developed to address these limitations by automating segmentation tasks; however, they often require large, annotated datasets that are rarely available in practice and frequently struggle to generalize across diverse imaging conditions due to inter-patient variability and rare pathological cases. In this paper, we propose Joint Retrieval Augmented Segmentation (J-RAS), a joint training method for guided image segmentation that integrates a segmentation model with a retrieval model. Both models are jointly optimized, enabling the segmentation model to leverage retrieved image-mask pairs to enrich its anatomical understanding, while the retrieval model learns segmentation-relevant features beyond simple visual similarity. This joint optimization ensures that retrieval actively contributes meaningful contextual cues to guide boundary delineation, thereby enhancing the overall segmentation performance. We validate J-RAS across multiple segmentation backbones, including U-Net, TransUNet, SAM, and SegFormer, on two benchmark datasets: ACDC and M&Ms, demonstrating consistent improvements. For example, on the ACDC dataset, SegFormer without J-RAS achieves a mean Dice score of 0.8708$\pm$0.042 and a mean Hausdorff Distance (HD) of 1.8130$\pm$2.49, whereas with J-RAS, the performance improves substantially to a mean Dice score of 0.9115$\pm$0.031 and a mean HD of 1.1489$\pm$0.30. These results highlight the method's effectiveness and its generalizability across architectures and datasets.
♻ ☆ TTT3R: 3D Reconstruction as Test-Time Training
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
comment: Page: https://rover-xingyu.github.io/TTT3R/ Code: https://github.com/Inception3D/TTT3R
♻ ☆ Denoised Diffusion for Object-Focused Image Augmentation
Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
♻ ☆ Image Quality Assessment for Embodied AI
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA
♻ ☆ TreeDiffusion: Hierarchical Generative Clustering for Conditional Diffusion ECML
Generative modeling and clustering are conventionally distinct tasks in machine learning. Variational Autoencoders (VAEs) have been widely explored for their ability to integrate both, providing a framework for generative clustering. However, while VAEs can learn meaningful cluster representations in latent space, they often struggle to generate high-quality samples. This paper addresses this problem by introducing TreeDiffusion, a deep generative model that conditions diffusion models on learned latent hierarchical cluster representations from a VAE to obtain high-quality, cluster-specific generations. Our approach consists of two steps: first, a VAE-based clustering model learns a hierarchical latent representation of the data. Second, a cluster-aware diffusion model generates realistic images conditioned on the learned hierarchical structure. We systematically compare the generative capabilities of our approach with those of alternative conditioning strategies. Empirically, we demonstrate that conditioning diffusion models on hierarchical cluster representations improves the generative performance on real-world datasets compared to other approaches. Moreover, a key strength of our method lies in its ability to generate images that are both representative and specific to each cluster, enabling more detailed visualization of the learned latent structure. Our approach addresses the generative limitations of VAE-based clustering approaches by leveraging their learned structure, thereby advancing the field of generative clustering.
comment: 31 pages, accepted to ECML PKDD 2025
♻ ☆ Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification (MIDOG 2025 Task 2 Winner)
Atypical mitotic figures (AMFs) represent abnormal cell division associated with poor prognosis. Yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we fine-tuned the recently published DINOv3-H+ vision transformer, pretrained on natural images, using low-rank adaptation (LoRA), training only ~1.3M parameters in combination with extensive augmentation and a domain-weighted Focal Loss to handle domain heterogeneity. Despite the domain gap, our fine-tuned DINOv3 transfers effectively to histopathology, reaching first place on the final test set. These results highlight the advantages of DINOv3 pretraining and underline the efficiency and robustness of our fine-tuning strategy, yielding state-of-the-art results for the atypical mitosis classification challenge in MIDOG 2025.
comment: 4 pages. Challenge report for MIDOG 2025 (Task 2: Atypical Mitotic Figure Classification)
♻ ☆ Enhancing Representations through Heterogeneous Self-Supervised Learning
Incorporating heterogeneous representations from different architectures has facilitated various vision tasks, e.g., some hybrid networks combine transformers and convolutions. However, complementarity between such heterogeneous architectures has not been well exploited in self-supervised learning. Thus, we propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. In this process, HSSL endows the base model with new characteristics in a representation learning way without structural changes. To comprehensively understand the HSSL, we conduct experiments on various heterogeneous pairs containing a base model and an auxiliary head. We discover that the representation quality of the base model moves up as their architecture discrepancy grows. This observation motivates us to propose a search strategy that quickly determines the most suitable auxiliary head for a specific base model to learn and several simple but effective methods to enlarge the model discrepancy. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection. The codes are available at https://github.com/NK-JittorCV/Self-Supervised/.
♻ ☆ Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications NeurIPS 2025
The practical application of AI tools for specific computer vision tasks relies on the "small-data regime" of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.
comment: 5 pages (main text), 3 figures. Accepted at the Imageomics Workshop at NeurIPS 2025
♻ ☆ FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions NeurIPS 2025
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
comment: Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM)
♻ ☆ Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.
♻ ☆ OmniLens: Towards Universal Lens Aberration Correction via LensLib-to-Specific Domain Adaptation
Emerging universal Computational Aberration Correction (CAC) paradigms provide an inspiring solution to light-weight and high-quality imaging with a universal model trained on a lens library (LensLib) to address arbitrary lens aberrations blindly. However, the limited coverage of existing LensLibs leads to poor generalization of the trained models to unseen lenses, whose fine-tuning pipeline is also confined to the lens-descriptions-known case. In this work, we introduce OmniLens, a flexible solution to universal CAC via (i) establishing a convincing LensLib with comprehensive coverage for pre-training a robust base model, and (ii) adapting the model to any specific lens designs with unknown lens descriptions via fast LensLib-to-specific domain adaptation. To achieve these, an Evolution-based Automatic Optical Design (EAOD) pipeline is proposed to generate a rich variety of lens samples with realistic aberration behaviors. Then, we design an unsupervised regularization term for efficient domain adaptation on a few easily accessible real-captured images based on the statistical observation of dark channel priors in degradation induced by lens aberrations. Extensive experiments demonstrate that the LensLib generated by EAOD effectively develops a universal CAC model with strong generalization capabilities, which can also improve the non-blind lens-specific methods by 0.35-1.81dB in PSNR. Additionally, the proposed domain adaptation method significantly improves the base model, especially in severe aberration cases (at most 2.59dB in PSNR). The code and data will be available at https://github.com/zju-jiangqi/OmniLens.
comment: The code and data will be available at https://github.com/zju-jiangqi/OmniLens
♻ ☆ Capturing More: Learning Multi-Domain Representations for Robust Online Handwriting Verification ACM MM 2025
In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM's superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.
comment: Accepted to ACM MM 2025
♻ ☆ Probabilistic Temporal Masked Attention for Cross-view Online Action Detection
As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols: cross-subject (cs), cross-view (cv), and cross-subject-view (csv) show that PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.
comment: 12 pages, 6 figures, accepted at IEEE Transactions on Multimedia (TMM), in press
♻ ☆ OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations NeurIPS 2025
3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.
comment: NeurIPS 2025
♻ ☆ VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning NeurIPS 2025
Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i.e.VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.
comment: Accepted by NeurIPS 2025. Code: https://github.com/QiWang98/VideoRFT
♻ ☆ BAAF: A benchmark attention adaptive framework for medical ultrasound image segmentation tasks
The AI-based assisted diagnosis programs have been widely investigated on medical ultrasound images. Complex scenario of ultrasound image, in which the coupled interference of internal and external factors is severe, brings a unique challenge for localize the object region automatically and precisely in ultrasound images. In this study, we seek to propose a more general and robust Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or diagnose lesions and tissues in ultrasound images more quickly and accurately. Different from existing attention schemes, the BAAF consists of a parallel hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM). Specifically, BAAF first coarsely calibrates the input features from the channel and spatial dimensions, and then adaptively selects more robust lesion or tissue characterizations from the coarse-calibrated feature maps. The design of BAAF further optimizes the "what" and "where" focus and selection problems in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in medical ultrasound images. The method is evaluated on four medical ultrasound segmentation tasks, and the adequate experimental results demonstrate the remarkable performance improvement over existing state-of-the-art methods. In addition, the comparison with existing attention mechanisms also demonstrates the superiority of BAAF. This work provides the possibility for automated medical ultrasound assisted diagnosis and reduces reliance on human accuracy and precision.
comment: 10 pages, 11 figures
♻ ☆ StegOT: Trade-offs in Steganography via Optimal Transport ICME 2025
Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.
comment: Accepted by IEEE International Conference on Multimedia and Expo (ICME 2025)
♻ ☆ Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.
comment: One of the first survey on Visual Language Models
♻ ☆ Online Topological Localization for Navigation Assistance in Bronchoscopy
Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.
♻ ☆ Logarithmic Mathematical Morphology: theory and applications
In Mathematical Morphology for grey-level functions, an image is analysed by another image named the structuring function. This structuring function is translated over the image domain and summed to the image. However, in an image presenting lighting variations, the amplitude of the structuring function should vary according to the image intensity. Such a property is not verified in Mathematical Morphology for grey level functions, when the structuring function is summed to the image with the usual additive law. In order to address this issue, a new framework is defined with an additive law for which the amplitude of the structuring function varies according to the image amplitude. This additive law is chosen within the Logarithmic Image Processing framework and models the lighting variations with a physical cause such as a change of light intensity. The new framework is named Logarithmic Mathematical Morphology (LMM) and allows the definition of operators which are robust to such lighting variations.
♻ ☆ Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries
Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery. Stereo matching with binocular endoscopy can provide this depth information. However, existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries and typically fail to meet real-time requirements for high-resolution endoscopic image inputs. To address these challenges, we propose \textbf{RRESM}, a real-time stereo matching method tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate Attention module that enhances cost aggregation through position-sensitive attention maps and long-range spatial dependency modeling via the Mamba block, generating a robust cost volume without substantial computational overhead. Additionally, we introduce a High-Frequency Disparity Optimization module that refines disparity predictions near tissue boundaries by amplifying high-frequency details in the wavelet domain. Evaluations on the SCARED and SERV-CT datasets demonstrate state-of-the-art matching accuracy with a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/RRESM.
♻ ☆ Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results MICCAI 2024
Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.
comment: This challenge was hosted in MICCAI 2024
♻ ☆ NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
comment: https://github.com/dunnolab/NinA/
♻ ☆ UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 18 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation.
♻ ☆ Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization
High-fidelity 3D scanning is essential for preserving cultural heritage artefacts, supporting documentation, analysis, and long-term conservation. However, conventional methods typically require specialized expertise and manual intervention to maintain optimal scanning conditions and coverage. We present an automated two-robot scanning system that eliminates the need for handheld or semi-automatic workflows by combining coordinated robotic manipulation with high-resolution 3D scanning. Our system parameterizes the scanning space into distinct regions, enabling coordinated motion planning between a scanner-equipped robot and a tray-handling robot. Optimized trajectory planning and waypoint distribution ensure comprehensive surface coverage, minimize occlusions, and balance reconstruction accuracy with system efficiency. Experimental results show that our approach achieves significantly lower Chamfer Distance and higher F-score compared to baseline methods, offering superior geometric accuracy, improved digitization efficiency, and reduced reliance on expert operators.
comment: The author has decided to withdraw this version to verify and update authorization details for certain image materials obtained from a collaborating institution. The issue is administrative and does not affect the technical content of the work. A revised version will be submitted once the verification process is complete
♻ ☆ OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/
comment: 30 pages, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. Project Page: https://rbler1234.github.io/OSTBench.github.io/
♻ ☆ In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting NeurIPS 2025
The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings. Project page: https://taiyi98.github.io/projects/EgoGazeVQA
comment: Accepted to NeurIPS 2025
♻ ☆ CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding
Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.
comment: The paper is not yet mature and needs further improvement
♻ ☆ Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
comment: Code: https://github.com/VisionXLab/AdapTok
♻ ☆ SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models
Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. In scalable applications, fine-tuning-based methods are time-consuming to precisely erase multiple target concepts, while real-time editing-based methods often degrade the generation quality of non-target concepts due to conflicting optimization objectives. To address this dilemma, we introduce SPEED, an efficient concept erasure approach that directly edits model parameters. SPEED searches for a null space, a model editing space where parameter updates do not affect non-target concepts, to achieve scalable and precise erasure. To facilitate accurate null space optimization, we incorporate three complementary strategies: Influence-based Prior Filtering (IPF) to selectively retain the most affected non-target concepts, Directed Prior Augmentation (DPA) to enrich the filtered retain set with semantically consistent variations, and Invariant Equality Constraints (IEC) to preserve key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure, successfully erasing 100 concepts within only 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.
comment: This version has been temporarily withdrawn for procedural review purposes. The withdrawal is unrelated to the technical content of the paper
♻ ☆ GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset NeurIPS 2025
Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid-resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision agriculture.In this paper, we introduce a more fine-grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine-grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manual annotation. GTPBD comprises 47,537 high-resolution images with three-level labels, including pixel-level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the world.Compared to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction, and unsupervised domain adaptation (UDA) tasks.Accordingly, we benchmark the GTPBD dataset on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods, and five UDA methods, along with a multi-dimensional evaluation framework integrating pixel-level and object-level metrics. GTPBD fills a critical gap in terraced remote sensing research, providing a basic infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer.
comment: 38 pages, 18 figures, submitted to NeurIPS 2025
♻ ☆ Tracing Back the Malicious Clients in Poisoning Attacks to Federated Learning NeurIPS
Poisoning attacks compromise the training phase of federated learning (FL) such that the learned global model misclassifies attacker-chosen inputs called target inputs. Existing defenses mainly focus on protecting the training phase of FL such that the learnt global model is poison free. However, these defenses often achieve limited effectiveness when the clients' local training data is highly non-iid or the number of malicious clients is large, as confirmed in our experiments. In this work, we propose FLForensics, the first poison-forensics method for FL. FLForensics complements existing training-phase defenses. In particular, when training-phase defenses fail and a poisoned global model is deployed, FLForensics aims to trace back the malicious clients that performed the poisoning attack after a misclassified target input is identified. We theoretically show that FLForensics can accurately distinguish between benign and malicious clients under a formal definition of poisoning attack. Moreover, we empirically show the effectiveness of FLForensics at tracing back both existing and adaptive poisoning attacks on five benchmark datasets.
comment: Conference on Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ Funny-Valen-Tine: Planning Solution Distribution Enhances Machine Abstract Reasoning Ability
Visual abstract reasoning is core to image processing. We present Valen, a unified probability-highlighting baseline that excels on both RPM (progression) and Bongard-Logo (clustering) tasks. Analysing its internals, we find solvers implicitly treat each task as a distribution where primary samples fit and auxiliaries do not; hence the learning target is jointly shaped by both sets, not by correct solutions alone. To close the gap we first introduce Tine, an adversarial adapter that nudges Valen toward correct-solution density, but adversarial training is unstable. We therefore replace it with Funny, a fast Gaussian-mixture model that directly estimates the correct-solution density without adversarial games, and extend the same paradigm to SBR for progressive-pattern planning. Extensive experiments show explicit distribution planning is the key to stronger, interpretable abstract reasoning. Codes are available in: https://github.com/Yuanbeiming/Funny-Valen-Tine-Planning-Solution-Distribution-Enhances-Machine-Abstract-Reasoning-Ability
comment: 14 pages, 20 figures, 3 tables
♻ ☆ Levarging Learning Bias for Noisy Anomaly Detection
This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework's contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.
♻ ☆ Contrast Sensitivity in Multimodal Large Language Models: A Psychophysics-Inspired Evaluation
Understanding how Multimodal Large Language Models (MLLMs) process low-level visual features is critical for evaluating their perceptual abilities and has not been systematically characterized. Inspired by human psychophysics, we introduce a behavioural method for estimating the Contrast Sensitivity Function (CSF) in MLLMs by treating them as end-to-end observers. Models are queried with structured prompts while viewing noise-based stimuli filtered at specific spatial frequencies. Psychometric functions are derived from the binary verbal responses, and contrast thresholds (and CSFs) are obtained without relying on internal activations or classifier-based proxies. Our results reveal that some models resemble human CSFs in shape or scale, but none capture both. We also find that CSF estimates are highly sensitive to prompt phrasing, indicating limited linguistic robustness. Finally, we show that CSFs predict model performance under frequency-filtered and adversarial conditions. These findings highlight systematic differences in frequency tuning across MLLMs and establish CSF estimation as a scalable diagnostic tool for multimodal perception.
♻ ☆ CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving WACV 2025
Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.
comment: WACV 2025, Project Page: https://turingmotors.github.io/covla-ad/
♻ ☆ Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation
Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.
♻ ☆ AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.
comment: 15pages, conference
♻ ☆ Extremely low-bitrate Image Compression Semantically Disentangled by LMMs from a Human Perception Perspective
It remains a significant challenge to compress images at extremely low bitrate while achieving both semantic consistency and high perceptual quality. Inspired by human progressive perception mechanism, we propose a Semantically Disentangled Image Compression framework (SEDIC) in this paper. Initially, an extremely compressed reference image is obtained through a learned image encoder. Then we leverage LMMs to extract essential semantic components, including overall descriptions, object detailed description, and semantic segmentation masks. We propose a training-free Object Restoration model with Attention Guidance (ORAG) built on pre-trained ControlNet to restore object details conditioned by object-level text descriptions and semantic masks. Based on the proposed ORAG, we design a multistage semantic image decoder to progressively restore the details object by object, starting from the extremely compressed reference image, ultimately generating high-quality and high-fidelity reconstructions. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at extremely low-bitrates ($\le$ 0.05 bpp).
♻ ☆ mmWave Radar-Based Non-Line-of-Sight Pedestrian Localization at T-Junctions Utilizing Road Layout Extraction via Camera
Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.
♻ ☆ GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization NeurIPS 2025
Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.
comment: NeurIPS 2025
♻ ☆ HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation ICCV 2025
In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https://github.com/WangYuLin-SEU/HCCEPose.
comment: International Conference on Computer Vision, ICCV 2025 (Highlight) https://iccv.thecvf.com/virtual/2025/poster/338
♻ ☆ CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy
Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.
♻ ☆ Finding Dori: Memorization in Text-to-Image Diffusion Models Is Not Local
Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such defenses. Our further analysis then provides multiple indications that memorization is indeed not inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the nature of memorization in text-to-image DMs and inform the development of more reliable mitigations against DM memorization.
♻ ☆ EgoBrain: Synergizing Minds and Eyes For Human Action Understanding
The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world's first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.
comment: 22 pages, 12 figures
♻ ☆ STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.
comment: Project Page: https://turingmotors.github.io/stride-qa/
♻ ☆ Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation WACV
The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.
comment: This paper has been accepted for the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025; Code released at https://github.com/gautierevn/BLIP2IDC
♻ ☆ Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval NeurIPS 2025
While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.
comment: NeurIPS 2025; 27 pages, 6 figures
♻ ☆ InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.
♻ ☆ Boosting Generic Semi-Supervised Medical Image Segmentation via Diverse Teaching and Label Propagation
Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.
comment: Under Review
♻ ☆ Prompt-guided Representation Disentanglement for Action Recognition
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git
♻ ☆ IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/L-O-I/IWR-Bench.
♻ ☆ Uncertainty-Supervised Interpretable and Robust Evidential Segmentation
Uncertainty estimation has been widely studied in medical image segmentation as a tool to provide reliability, particularly in deep learning approaches. However, previous methods generally lack effective supervision in uncertainty estimation, leading to low interpretability and robustness of the predictions. In this work, we propose a self-supervised approach to guide the learning of uncertainty. Specifically, we introduce three principles about the relationships between the uncertainty and the image gradients around boundaries and noise. Based on these principles, two uncertainty supervision losses are designed. These losses enhance the alignment between model predictions and human interpretation. Accordingly, we introduce novel quantitative metrics for evaluating the interpretability and robustness of uncertainty. Experimental results demonstrate that compared to state-of-the-art approaches, the proposed method can achieve competitive segmentation performance and superior results in out-of-distribution (OOD) scenarios while significantly improving the interpretability and robustness of uncertainty estimation. Code is available via https://github.com/suiannaius/SURE.
♻ ☆ GarmageNet: A Multimodal Generative Framework for Sewing Pattern Design and Generic Garment Modeling
Realistic digital garment modeling remains a labor-intensive task due to the intricate process of translating 2D sewing patterns into high-fidelity, simulation-ready 3D garments. We introduce GarmageNet, a unified generative framework that automates the creation of 2D sewing patterns, the construction of sewing relationships, and the synthesis of 3D garment initializations compatible with physics-based simulation. Central to our approach is Garmage, a novel garment representation that encodes each panel as a structured geometry image, effectively bridging the semantic and geometric gap between 2D structural patterns and 3D garment geometries. Followed by GarmageNet, a latent diffusion transformer to synthesize panel-wise geometry images and GarmageJigsaw, a neural module for predicting point-to-point sewing connections along panel contours. To support training and evaluation, we build GarmageSet, a large-scale dataset comprising 14,801 professionally designed garments with detailed structural and style annotations. Our method demonstrates versatility and efficacy across multiple application scenarios, including scalable garment generation from multi-modal design concepts (text prompts, sketches, photographs), automatic modeling from raw flat sewing patterns, pattern recovery from unstructured point clouds, and progressive garment editing using conventional instructions, laying the foundation for fully automated, production-ready pipelines in digital fashion. Project page: https://style3d.github.io/garmagenet/.
comment: 23 pages,20 figures
♻ ☆ EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning
Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for enhancing vision-language navigation (VLN) performance, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches predominantly adopt straightforward input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. To address these issues, we propose EvolveNav, a novel sElf-improving embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning for boosting LLM-based vision-language Navigation. Specifically, EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning capabilities, and simultaneously increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also designed to encourage the model to learn correct reasoning patterns by contrasting with wrong ones. Experimental results under both task-specific and cross-task training paradigms demonstrate the consistent superiority of EvolveNav over previous LLM-based VLN approaches on various popular benchmarks, including R2R, REVERIE, CVDN, and SOON. Code is available at https://github.com/expectorlin/EvolveNav.
♻ ☆ DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding
Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM-derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.
comment: 8 pages, 6 figures, Project Page: https://binicey.github.io/DSM
♻ ☆ FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction
Per- and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies, robustness analysis, real-world validation, and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results and expert feedback highlight our framework's potential for scalable PFAS monitoring.
♻ ☆ Backdoor Mitigation via Invertible Pruning Masks
Model pruning has gained traction as a promising defense strategy against backdoor attacks in deep learning. However, existing pruning-based approaches often fall short in accurately identifying and removing the specific parameters responsible for inducing backdoor behaviors. Despite the dominance of fine-tuning-based defenses in recent literature, largely due to their superior performance, pruning remains a compelling alternative, offering greater interpretability and improved robustness in low-data regimes. In this paper, we propose a novel pruning approach featuring a learned \emph{selection} mechanism to identify parameters critical to both main and backdoor tasks, along with an \emph{invertible} pruning mask designed to simultaneously achieve two complementary goals: eliminating the backdoor task while preserving it through the inverse mask. We formulate this as a bi-level optimization problem that jointly learns selection variables, a sparse invertible mask, and sample-specific backdoor perturbations derived from clean data. The inner problem synthesizes candidate triggers using the inverse mask, while the outer problem refines the mask to suppress backdoor behavior without impairing clean-task accuracy. Extensive experiments demonstrate that our approach outperforms existing pruning-based backdoor mitigation approaches, maintains strong performance under limited data conditions, and achieves competitive results compared to state-of-the-art fine-tuning approaches. Notably, the proposed approach is particularly effective in restoring correct predictions for compromised samples after successful backdoor mitigation.
♻ ☆ Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. FramePack compresses input frame contexts with frame-wise importance so that more frames can be encoded within a fixed context length, with more important frames having longer contexts. The frame importance can be measured using time proximity, feature similarity, or hybrid metrics. The packing method allows for inference with thousands of frames and training with relatively large batch sizes. We also present drift prevention methods to address observation bias (error accumulation), including early-established endpoints, adjusted sampling orders, and discrete history representation. Ablation studies validate the effectiveness of the anti-drifting methods in both single-directional video streaming and bi-directional video generation. Finally, we show that existing video diffusion models can be finetuned with FramePack, and analyze the differences between different packing schedules.
comment: https://github.com/lllyasviel/FramePack
♻ ☆ Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition
Handwritten Text Recognition (HTR) is crucial for document digitization, but handwritten data can contain user-identifiable features, like unique writing styles, posing privacy risks. Regulations such as the ``right to be forgotten'' require models to remove these sensitive traces without full retraining. We introduce a practical encoder-only transformer baseline as a robust reference for future HTR research. Building on this, we propose a two-stage unlearning framework for multihead transformer HTR models. Our method combines neural pruning with machine unlearning applied to a writer classification head, ensuring sensitive information is removed while preserving the recognition head. We also present Writer-ID Confusion (WIC), a method that forces the forget set to follow a uniform distribution over writer identities, unlearning user-specific cues while maintaining text recognition performance. We compare WIC to Random Labeling, Fisher Forgetting, Amnesiac Unlearning, and DELETE within our prune-unlearn pipeline and consistently achieve better privacy and accuracy trade-offs. This is the first systematic study of machine unlearning for HTR. Using metrics such as Accuracy, Character Error Rate (CER), Word Error Rate (WER), and Membership Inference Attacks (MIA) on the IAM and CVL datasets, we demonstrate that our method achieves state-of-the-art or superior performance for effective unlearning. These experiments show that our approach effectively safeguards privacy without compromising accuracy, opening new directions for document analysis research. Our code is publicly available at https://github.com/leitro/WIC-WriterIDConfusion-MachineUnlearning.
comment: Accepted for publication in Pattern Recognition, DOI: https://doi.org/10.1016/j.patcog.2025.112411
♻ ☆ Streaming Neural Images ICIP
Implicit Neural Representations (INRs) are a novel paradigm for signal representation that have attracted considerable interest for image compression. INRs offer unprecedented advantages in signal resolution and memory efficiency, enabling new possibilities for compression techniques. However, the existing limitations of INRs for image compression have not been sufficiently addressed in the literature. In this work, we explore the critical yet overlooked limiting factors of INRs, such as computational cost, unstable performance, and robustness. Through extensive experiments and empirical analysis, we provide a deeper and more nuanced understanding of implicit neural image compression methods such as Fourier Feature Networks and Siren. Our work also offers valuable insights for future research in this area.
comment: IEEE International Conference on Image Processing (ICIP) 2024 - Highlight
♻ ☆ Extreme Compression of Adaptive Neural Images ICCV 2025
Implicit Neural Representations (INRs) and Neural Fields are a novel paradigm for signal representation, from images and audio to 3D scenes and videos. The fundamental idea is to represent a signal as a continuous and differentiable neural network. This new approach poses new theoretical questions and challenges. Considering a neural image as a 2D image represented as a neural network, we aim to explore novel neural image compression. In this work, we present a novel analysis on compressing neural fields, with focus on images and introduce Adaptive Neural Images (ANI), an efficient neural representation that enables adaptation to different inference or transmission requirements. Our proposed method allows us to reduce the bits-per-pixel (bpp) of the neural image by 8 times, without losing sensitive details or harming fidelity. Our work offers a new framework for developing compressed neural fields. We achieve a new state-of-the-art in terms of PSNR/bpp trade-off thanks to our successful implementation of 4-bit neural representations.
comment: ICCV 2025 Workshop - Binary and Extreme Quantization for Computer Vision
♻ ☆ CAGE: Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction
We present CAGE (Continuity-Aware edGE) network, a robust framework for reconstructing vector floorplans directly from point-cloud density maps. Traditional corner-based polygon representations are highly sensitive to noise and incomplete observations, often resulting in fragmented or implausible layouts.Recent line grouping methods leverage structural cues to improve robustness but still struggle to recover fine geometric details. To address these limitations,we propose a native edge-centric formulation, modeling each wall segment as a directed, geometrically continuous edge. This representation enables inference of coherent floorplan structures, ensuring watertight, topologically valid room boundaries while improving robustness and reducing artifacts. Towards this design, we develop a dual-query transformer decoder that integrates perturbed and latent queries within a denoising framework, which not only stabilizes optimization but also accelerates convergence. Extensive experiments on Structured3D and SceneCAD show that CAGE achieves state-of-the-art performance, with F1 scores of 99.1% (rooms), 91.7% (corners), and 89.3% (angles). The method also demonstrates strong cross-dataset generalization, underscoring the efficacy of our architectural innovations. Code and pretrained models are available on our project page: https://github.com/ee-Liu/CAGE.git.
♻ ☆ Joint Denoising of Cryo-EM Projection Images using Polar Transformers
Many imaging modalities involve reconstruction of unknown objects from collections of noisy projections related by random rotations. In one of these modalities, cryogenic electron microscopy (cryo-EM), the extremely low signal-to-noise ratio (SNR) makes integration of information from multiple images crucial. Existing approaches to cryo-EM processing, however, either rely on handcrafted priors or apply deep learning only on select portions of the pipeline, such as particle picking, micrograph denoising, or refinement. A fully end-to-end reconstruction approach requires a neural network architecture that integrates information from multiple images while respecting the rotational symmetry of the measurement process. In this work, we introduce the polar transformer, a new neural network architecture that combines polar representations and transformers along with a convolutional attention mechanism that preserves the rotational symmetry of the problem. We apply it to the particle-level denoising problem, where it is able to learn discriminative features in the images, enabling optimal clustering, alignment, and denoising. On simulated datasets, this achieves up to a $2\times$ reduction in mean squared error (MSE) at a signal-to-noise ratio (SNR) of $0.02$, suggesting new opportunities for data-driven approaches to reconstruction in cryo-EM and related tomographic modalities.
♻ ☆ Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning
The Privacy-sensitive Object Identification (POI) task allocates bounding boxes for privacy-sensitive objects in a scene. The key to POI is settling an object's privacy class (privacy-sensitive or non-privacy-sensitive). In contrast to conventional object classes which are determined by the visual appearance of an object, one object's privacy class is derived from the scene contexts and is subject to various implicit factors beyond its visual appearance. That is, visually similar objects may be totally opposite in their privacy classes. To explicitly derive the objects' privacy class from the scene contexts, in this paper, we interpret the POI task as a visual reasoning task aimed at the privacy of each object in the scene. Following this interpretation, we propose the PrivacyGuard framework for POI. PrivacyGuard contains three stages. i) Structuring: an unstructured image is first converted into a structured, heterogeneous scene graph that embeds rich scene contexts. ii) Data Augmentation: a contextual perturbation oversampling strategy is proposed to create slightly perturbed privacy-sensitive objects in a scene graph, thereby balancing the skewed distribution of privacy classes. iii) Hybrid Graph Generation & Reasoning: the balanced, heterogeneous scene graph is then transformed into a hybrid graph by endowing it with extra "node-node" and "edge-edge" homogeneous paths. These homogeneous paths allow direct message passing between nodes or edges, thereby accelerating reasoning and facilitating the capturing of subtle context changes. Based on this hybrid graph... **For the full abstract, see the original paper.**
comment: I would like to formally request the withdrawal of my manuscript from arXiv. After a further internal review, I realized that the dataset used in this study contains personal or sensitive information that may inadvertently compromise individuals' privacy
♻ ☆ Orthogonal Finetuning Made Scalable
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in the Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
comment: EMNLP 2025 Main (18 pages, 7 figures, project page: https://spherelab.ai/oftv2/)
♻ ☆ SHAN: Object-Level Privacy Detection via Inference on Scene Heterogeneous Graph
With the rise of social platforms, protecting privacy has become an important issue. Privacy object detection aims to accurately locate private objects in images. It is the foundation of safeguarding individuals' privacy rights and ensuring responsible data handling practices in the digital age. Since privacy of object is not shift-invariant, the essence of the privacy object detection task is inferring object privacy based on scene information. However, privacy object detection has long been studied as a subproblem of common object detection tasks. Therefore, existing methods suffer from serious deficiencies in accuracy, generalization, and interpretability. Moreover, creating large-scale privacy datasets is difficult due to legal constraints and existing privacy datasets lack label granularity. The granularity of existing privacy detection methods remains limited to the image level. To address the above two issues, we introduce two benchmark datasets for object-level privacy detection and propose SHAN, Scene Heterogeneous graph Attention Network, a model constructs a scene heterogeneous graph from an image and utilizes self-attention mechanisms for scene inference to obtain object privacy. Through experiments, we demonstrated that SHAN performs excellently in privacy object detection tasks, with all metrics surpassing those of the baseline model.
comment: I would like to formally request the withdrawal of my manuscript from arXiv. After a further internal review, I realized that the dataset used in this study contains personal or sensitive information that may inadvertently compromise individuals' privacy
♻ ☆ Unconditional CNN denoisers contain sparse semantic representation of images
Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine the image representation that arises from score estimation in a {fully-convolutional unconditional UNet}. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. Euclidean distances in this representation space are semantically meaningful, even though no conditioning information is provided during training. We develop a novel algorithm for stochastic reconstruction of images conditioned on this representation: The synthesis using the unconditional model is "self-guided" by the representation extracted from that very same model. For a given representation, the common patterns in the set of reconstructed samples reveal the features captured in the middle block of the UNet. Together, these results show, for the first time, that a measure of semantic similarity emerges, unsupervised, solely from the denoising objective.
♻ ☆ Survival Modeling from Whole Slide Images via Patch-Level Graph Clustering and Mixture Density Experts
We introduce a modular framework for predicting cancer-specific survival from whole slide pathology images (WSIs) that significantly improves upon the state-of-the-art accuracy. Our method integrating four key components. Firstly, to tackle large size of WSIs, we use dynamic patch selection via quantile-based thresholding for isolating prognostically informative tissue regions. Secondly, we use graph-guided k-means clustering to capture phenotype-level heterogeneity through spatial and morphological coherence. Thirdly, we use attention mechanisms that model both intra- and inter-cluster relationships to contextualize local features within global spatial relations between various types of tissue compartments. Finally, we use an expert-guided mixture density modeling for estimating complex survival distributions using Gaussian mixture models. The proposed model achieves a concordance index of $0.712 \pm 0.028$ and Brier score of $0.254 \pm 0.018$ on TCGA-KIRC (renal cancer), and a concordance index of $0.645 \pm 0.017$ and Brier score of $0.281 \pm 0.031$ on TCGA-LUAD (lung adenocarcinoma). These results are significantly better than the state-of-art and demonstrate predictive potential of the proposed method across diverse cancer types.
♻ ☆ Joint Denoising of Cryo-EM Projection Images using Polar Transformers
Many imaging modalities involve reconstruction of unknown objects from collections of noisy projections related by random rotations. In one of these modalities, cryogenic electron microscopy (cryo-EM), the extremely low signal-to-noise ratio (SNR) makes integration of information from multiple images crucial. Existing approaches to cryo-EM processing, however, either rely on handcrafted priors or apply deep learning only on select portions of the pipeline, such as particle picking, micrograph denoising, or refinement. A fully end-to-end reconstruction approach requires a neural network architecture that integrates information from multiple images while respecting the rotational symmetry of the measurement process. In this work, we introduce the polar transformer, a new neural network architecture that combines polar representations and transformers along with a convolutional attention mechanism that preserves the rotational symmetry of the problem. We apply it to the particle-level denoising problem, where it is able to learn discriminative features in the images, enabling optimal clustering, alignment, and denoising. On simulated datasets, this achieves up to a $2\times$ reduction in mean squared error (MSE) at a signal-to-noise ratio (SNR) of $0.02$, suggesting new opportunities for data-driven approaches to reconstruction in cryo-EM and related tomographic modalities.
Machine Learning 279
☆ CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations ICCV 2025
We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging and computer vision datasets show the benefit CuMPerLay on classification and segmentation performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.
comment: Appears at ICCV 2025
☆ UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.
comment: Project page at https://thekevinli.github.io/unifusion/
☆ Wavefront Coding for Accommodation-Invariant Near-Eye Displays
We present a new computational near-eye display method that addresses the vergence-accommodation conflict problem in stereoscopic displays through accommodation-invariance. Our system integrates a refractive lens eyepiece with a novel wavefront coding diffractive optical element, operating in tandem with a pre-processing convolutional neural network. We employ end-to-end learning to jointly optimize the wavefront-coding optics and the image pre-processing module. To implement this approach, we develop a differentiable retinal image formation model that accounts for limiting aperture and chromatic aberrations introduced by the eye optics. We further integrate the neural transfer function and the contrast sensitivity function into the loss model to account for related perceptual effects. To tackle off-axis distortions, we incorporate position dependency into the pre-processing module. In addition to conducting rigorous analysis based on simulations, we also fabricate the designed diffractive optical element and build a benchtop setup, demonstrating accommodation-invariance for depth ranges of up to four diopters.
☆ Dr.LLM: Dynamic Layer Routing in LLMs
Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.
comment: 17 pages, Under submission
☆ Sample-Efficient Omniprediction for Proper Losses
We consider the problem of constructing probabilistic predictions that lead to accurate decisions when employed by downstream users to inform actions. For a single decision maker, designing an optimal predictor is equivalent to minimizing a proper loss function corresponding to the negative utility of that individual. For multiple decision makers, our problem can be viewed as a variant of omniprediction in which the goal is to design a single predictor that simultaneously minimizes multiple losses. Existing algorithms for achieving omniprediction broadly fall into two categories: 1) boosting methods that optimize other auxiliary targets such as multicalibration and obtain omniprediction as a corollary, and 2) adversarial two-player game based approaches that estimate and respond to the ``worst-case" loss in an online fashion. We give lower bounds demonstrating that multicalibration is a strictly more difficult problem than omniprediction and thus the former approach must incur suboptimal sample complexity. For the latter approach, we discuss how these ideas can be used to obtain a sample-efficient algorithm through an online-to-batch conversion. This conversion has the downside of returning a complex, randomized predictor. We improve on this method by designing a more direct, unrandomized algorithm that exploits structural elements of the set of proper losses.
☆ AnyUp: Universal Feature Upsampling
We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.
comment: Project Website: https://wimmerth.github.io/anyup/
☆ KoALA: KL-L0 Adversarial Detector via Label Agreement
Deep neural networks are highly susceptible to adversarial attacks, which pose significant risks to security- and safety-critical applications. We present KoALA (KL-L0 Adversarial detection via Label Agreement), a novel, semantics-free adversarial detector that requires no architectural changes or adversarial retraining. KoALA operates on a simple principle: it detects an adversarial attack when class predictions from two complementary similarity metrics disagree. These metrics-KL divergence and an L0-based similarity-are specifically chosen to detect different types of perturbations. The KL divergence metric is sensitive to dense, low-amplitude shifts, while the L0-based similarity is designed for sparse, high-impact changes. We provide a formal proof of correctness for our approach. The only training required is a simple fine-tuning step on a pre-trained image encoder using clean images to ensure the embeddings align well with both metrics. This makes KOALA a lightweight, plug-and-play solution for existing models and various data modalities. Our extensive experiments on ResNet/CIFAR-10 and CLIP/Tiny-ImageNet confirm our theoretical claims. When the theorem's conditions are met, KoALA consistently and effectively detects adversarial examples. On the full test sets, KoALA achieves a precision of 0.94 and a recall of 0.81 on ResNet/CIFAR-10, and a precision of 0.66 and a recall of 0.85 on CLIP/Tiny-ImageNet.
☆ VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.
☆ Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps
We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $\epsilon$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.
comment: Do Tien Hai, Trung Nguyen Mai, and TrungTin Nguyen are co-first authors
☆ Doctor Rashomon and the UNIVERSE of Madness: Variable Importance with Unobserved Confounding and the Rashomon Effect
Variable importance (VI) methods are often used for hypothesis generation, feature selection, and scientific validation. In the standard VI pipeline, an analyst estimates VI for a single predictive model with only the observed features. However, the importance of a feature depends heavily on which other variables are included in the model, and essential variables are often omitted from observational datasets. Moreover, the VI estimated for one model is often not the same as the VI estimated for another equally-good model - a phenomenon known as the Rashomon Effect. We address these gaps by introducing UNobservables and Inference for Variable importancE using Rashomon SEts (UNIVERSE). Our approach adapts Rashomon sets - the sets of near-optimal models in a dataset - to produce bounds on the true VI even with missing features. We theoretically guarantee the robustness of our approach, show strong performance on semi-synthetic simulations, and demonstrate its utility in a credit risk task.
☆ HYPE: Hybrid Planning with Ego Proposal-Conditioned Predictions
Safe and interpretable motion planning in complex urban environments needs to reason about bidirectional multi-agent interactions. This reasoning requires to estimate the costs of potential ego driving maneuvers. Many existing planners generate initial trajectories with sampling-based methods and refine them by optimizing on learned predictions of future environment states, which requires a cost function that encodes the desired vehicle behavior. Designing such a cost function can be very challenging, especially if a wide range of complex urban scenarios has to be considered. We propose HYPE: HYbrid Planning with Ego proposal-conditioned predictions, a planner that integrates multimodal trajectory proposals from a learned proposal model as heuristic priors into a Monte Carlo Tree Search (MCTS) refinement. To model bidirectional interactions, we introduce an ego-conditioned occupancy prediction model, enabling consistent, scene-aware reasoning. Our design significantly simplifies cost function design in refinement by considering proposal-driven guidance, requiring only minimalistic grid-based cost terms. Evaluations on large-scale real-world benchmarks nuPlan and DeepUrban show that HYPE effectively achieves state-of-the-art performance, especially in safety and adaptability.
☆ Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles. The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. This shift enables a new paradigm: data-model co-evolution, where a living test set and a model's instructions evolve in tandem. We operationalize this paradigm in an interactive system designed to address the critical challenge of encoding subtle, domain-specific policies into prompt instructions. The system's structured workflow guides people to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate instruction revisions against a growing test set. A user study shows our workflow helps participants refine instructions systematically and specify ambiguous policies more concretely. This work points toward more robust and responsible LLM applications through human-in-the-loop development aligned with local preferences and policies.
☆ Hierarchical Federated Learning for Crop Yield Prediction in Smart Agricultural Production Systems
In this paper, we presents a novel hierarchical federated learning architecture specifically designed for smart agricultural production systems and crop yield prediction. Our approach introduces a seasonal subscription mechanism where farms join crop-specific clusters at the beginning of each agricultural season. The proposed three-layer architecture consists of individual smart farms at the client level, crop-specific aggregators at the middle layer, and a global model aggregator at the top level. Within each crop cluster, clients collaboratively train specialized models tailored to specific crop types, which are then aggregated to produce a higher-level global model that integrates knowledge across multiple crops. This hierarchical design enables both local specialization for individual crop types and global generalization across diverse agricultural contexts while preserving data privacy and reducing communication overhead. Experiments demonstrate the effectiveness of the proposed system, showing that local and crop-layer models closely follow actual yield patterns with consistent alignment, significantly outperforming standard machine learning models. The results validate the advantages of hierarchical federated learning in the agricultural context, particularly for scenarios involving heterogeneous farming environments and privacy-sensitive agricultural data.
comment: 6 pages, 3 figures, conference
☆ Improving Decision Trees through the Lens of Parameterized Local Search NeurIPS 2025
Algorithms for learning decision trees often include heuristic local-search operations such as (1) adjusting the threshold of a cut or (2) also exchanging the feature of that cut. We study minimizing the number of classification errors by performing a fixed number of a single type of these operations. Although we discover that the corresponding problems are NP-complete in general, we provide a comprehensive parameterized-complexity analysis with the aim of determining those properties of the problems that explain the hardness and those that make the problems tractable. For instance, we show that the problems remain hard for a small number $d$ of features or small domain size $D$ but the combination of both yields fixed-parameter tractability. That is, the problems are solvable in $(D + 1)^{2d} \cdot |I|^{O(1)}$ time, where $|I|$ is the size of the input. We also provide a proof-of-concept implementation of this algorithm and report on empirical results.
comment: Accepted at NeurIPS 2025
☆ CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.
comment: Accepted at EMNLP Findings 2025
☆ Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction
Chemical pretrained models, sometimes referred to as foundation models, are receiving considerable interest for drug discovery applications. The general chemical knowledge extracted from self-supervised training has the potential to improve predictions for critical drug discovery endpoints, including on-target potency and ADMET properties. Multi-task learning has previously been successfully leveraged to improve predictive models. Here, we show that enabling multitasking in finetuning of chemical pretrained graph neural network models such as Kinetic GROVER Multi-Task (KERMT), an enhanced version of the GROVER model, and Knowledge-guided Pre-training of Graph Transformer (KGPT) significantly improves performance over non-pretrained graph neural network models. Surprisingly, we find that the performance improvement from finetuning KERMT in a multitask manner is most significant at larger data sizes. Additionally, we publish two multitask ADMET data splits to enable more accurate benchmarking of multitask deep learning methods for drug property prediction. Finally, we provide an accelerated implementation of the KERMT model on GitHub, unlocking large-scale pretraining, finetuning, and inference in industrial drug discovery workflows.
☆ Topological Signatures of ReLU Neural Network Activation Patterns
This paper explores the topological signatures of ReLU neural network activation patterns. We consider feedforward neural networks with ReLU activation functions and analyze the polytope decomposition of the feature space induced by the network. Mainly, we investigate how the Fiedler partition of the dual graph and show that it appears to correlate with the decision boundary -- in the case of binary classification. Additionally, we compute the homology of the cellular decomposition -- in a regression task -- to draw similar patterns in behavior between the training loss and polyhedral cell-count, as the model is trained.
☆ Who is a Better Matchmaker? Human vs. Algorithmic Judge Assignment in a High-Stakes Startup Competition
There is growing interest in applying artificial intelligence (AI) to automate and support complex decision-making tasks. However, it remains unclear how algorithms compare to human judgment in contexts requiring semantic understanding and domain expertise. We examine this in the context of the judge assignment problem, matching submissions to suitably qualified judges. Specifically, we tackled this problem at the Harvard President's Innovation Challenge, the university's premier venture competition awarding over \$500,000 to student and alumni startups. This represents a real-world environment where high-quality judge assignment is essential. We developed an AI-based judge-assignment algorithm, Hybrid Lexical-Semantic Similarity Ensemble (HLSE), and deployed it at the competition. We then evaluated its performance against human expert assignments using blinded match-quality scores from judges on $309$ judge-venture pairs. Using a Mann-Whitney U statistic based test, we found no statistically significant difference in assignment quality between the two approaches ($AUC=0.48, p=0.40$); on average, algorithmic matches are rated $3.90$ and manual matches $3.94$ on a 5-point scale, where 5 indicates an excellent match. Furthermore, manual assignments that previously required a full week could be automated in several hours by the algorithm during deployment. These results demonstrate that HLSE achieves human-expert-level matching quality while offering greater scalability and efficiency, underscoring the potential of AI-driven solutions to support and enhance human decision-making for judge assignment in high-stakes settings.
comment: 17 Pages, 2 figures
☆ DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization
Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when only corrupted or noisy observations are available remains challenging. In this work, we propose a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data. Our proposed method, DiffEM, utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.
☆ EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels
Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at https://github.com/KPeng9510/ERELIFM.
comment: The source code is available at https://github.com/KPeng9510/ERELIFM
☆ Few Shot Semi-Supervised Learning for Abnormal Stop Detection from Sparse GPS Trajectories
Abnormal stop detection (ASD) in intercity coach transportation is critical for ensuring passenger safety, operational reliability, and regulatory compliance. However, two key challenges hinder ASD effectiveness: sparse GPS trajectories, which obscure short or unauthorized stops, and limited labeled data, which restricts supervised learning. Existing methods often assume dense sampling or regular movement patterns, limiting their applicability. To address data sparsity, we propose a Sparsity-Aware Segmentation (SAS) method that adaptively defines segment boundaries based on local spatial-temporal density. Building upon these segments, we introduce three domain-specific indicators to capture abnormal stop behaviors. To further mitigate the impact of sparsity, we develop Locally Temporal-Indicator Guided Adjustment (LTIGA), which smooths these indicators via local similarity graphs. To overcome label scarcity, we construct a spatial-temporal graph where each segment is a node with LTIGA-refined features. We apply label propagation to expand weak supervision across the graph, followed by a GCN to learn relational patterns. A final self-training module incorporates high-confidence pseudo-labels to iteratively improve predictions. Experiments on real-world coach data show an AUC of 0.854 and AP of 0.866 using only 10 labeled instances, outperforming prior methods. The code and dataset are publicly available at \href{https://github.com/pangjunbiao/Abnormal-Stop-Detection-SSL.git}
☆ CoRA: Covariate-Aware Adaptation of Time Series Foundation Models
Time Series Foundation Models (TSFMs) have shown significant impact through their model capacity, scalability, and zero-shot generalization. However, due to the heterogeneity of inter-variate dependencies and the backbone scalability on large-scale multivariate datasets, most TSFMs are typically pre-trained on univariate time series. This limitation renders them oblivious to crucial information from diverse covariates in real-world forecasting tasks. To further enhance the performance of TSFMs, we propose a general covariate-aware adaptation (CoRA) framework for TSFMs. It leverages pre-trained backbones of foundation models while effectively incorporating exogenous covariates from various modalities, including time series, language, and images, to improve the quality of predictions. Technically, CoRA maintains the equivalence of initialization and parameter consistency during adaptation. With preserved backbones of foundation models as frozen feature extractors, the outcome embeddings from foundation models are empirically demonstrated more informative than raw data. Further, CoRA employs a novel Granger Causality Embedding (GCE) to automatically evaluate covariates regarding their causal predictability with respect to the target variate. We incorporate these weighted embeddings with a zero-initialized condition-injection mechanism, avoiding catastrophic forgetting of pre-trained foundation models and gradually integrates exogenous information. Extensive experiments show that CoRA of TSFMs surpasses state-of-the-art covariate-aware deep forecasters with full or few-shot training samples, achieving 31.1% MSE reduction on covariate-aware forecasting. Compared to other adaptation methods, CoRA exhibits strong compatibility with various advanced TSFMs and extends the scope of covariates to other modalities, presenting a practical paradigm for the application of TSFMs.
☆ Demystifying Hybrid Thinking: Can LLMs Truly Switch Between Think and No-Think?
Hybrid thinking enables LLMs to switch between reasoning and direct answering, offering a balance between efficiency and reasoning capability. Yet our experiments reveal that current hybrid thinking LLMs only achieve partial mode separation: reasoning behaviors often leak into the no-think mode. To understand and mitigate this, we analyze the factors influencing controllability and identify four that matter most: (1) larger data scale, (2) using think and no-think answers from different questions rather than the same question, (3) a moderate increase in no-think data number, and (4) a two-phase strategy that first trains reasoning ability and then applies hybrid think training. Building on these findings, we propose a practical recipe that, compared to standard training, can maintain accuracy in both modes while significantly reducing no-think output length (from $1085$ to $585$ on MATH500) and occurrences of reasoning-supportive tokens such as ``\texttt{wait}'' (from $5917$ to $522$ on MATH500). Our findings highlight the limitations of current hybrid thinking and offer directions for strengthening its controllability.
comment: 10 pages, 6 figures
☆ Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation \textbf{CALM}, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging \gls*{cw} technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.
☆ Structure-Aware Spectral Sparsification via Uniform Edge Sampling NeurIPS 2025
Spectral clustering is a fundamental method for graph partitioning, but its reliance on eigenvector computation limits scalability to massive graphs. Classical sparsification methods preserve spectral properties by sampling edges proportionally to their effective resistances, but require expensive preprocessing to estimate these resistances. We study whether uniform edge sampling-a simple, structure-agnostic strategy-can suffice for spectral clustering. Our main result shows that for graphs admitting a well-separated $k$-clustering, characterized by a large structure ratio $\Upsilon(k) = \lambda_{k+1} / \rho_G(k)$, uniform sampling preserves the spectral subspace used for clustering. Specifically, we prove that uniformly sampling $O(\gamma^2 n \log n / \epsilon^2)$ edges, where $\gamma$ is the Laplacian condition number, yields a sparsifier whose top $(n-k)$-dimensional eigenspace is approximately orthogonal to the cluster indicators. This ensures that the spectral embedding remains faithful, and clustering quality is preserved. Our analysis introduces new resistance bounds for intra-cluster edges, a rank-$(n-k)$ effective resistance formulation, and a matrix Chernoff bound adapted to the dominant eigenspace. These tools allow us to bypass importance sampling entirely. Conceptually, our result connects recent coreset-based clustering theory to spectral sparsification, showing that under strong clusterability, even uniform sampling is structure-aware. This provides the first provable guarantee that uniform edge sampling suffices for structure-preserving spectral clustering.
comment: 19 pages, 4 figures, NeurIPS 2025
☆ Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models
Whisper models have achieved remarkable progress in speech recognition; yet their large size remains a bottleneck for deployment on resource-constrained edge devices. This paper proposes a framework to design fine-tuned variants of Whisper which address the above problem. Structured sparsity is enforced via the Sparse Group LASSO penalty as a loss regularizer, to reduce the number of FLOating Point operations (FLOPs). Further, a weight statistics aware pruning algorithm is proposed. We also design our custom text normalizer for WER evaluation. On Common Voice 11.0 Hindi dataset, we obtain, without degrading WER, (a) 35.4% reduction in model parameters, 14.25% lower memory consumption and 18.5% fewer FLOPs on Whisper-small, and (b) 31% reduction in model parameters, 15.29% lower memory consumption and 16.95% fewer FLOPs on Whisper-medium; and, (c) substantially outperform the state-of-the-art Iterative Magnitude Pruning based method by pruning 18.7% more parameters along with a 12.31 reduction in WER.
☆ SG-XDEAT: Sparsity-Guided Cross-Dimensional and Cross-Encoding Attention with Target-Aware Conditioning in Tabular Learning
We propose SG-XDEAT (Sparsity-Guided Cross Dimensional and Cross-Encoding Attention with Target Aware Conditioning), a novel framework designed for supervised learning on tabular data. At its core, SG-XDEAT employs a dual-stream encoder that decomposes each input feature into two parallel representations: a raw value stream and a target-conditioned (label-aware) stream. These dual representations are then propagated through a hierarchical stack of attention-based modules. SG-XDEAT integrates three key components: (i) Cross-Dimensional self-attention, which captures intra-view dependencies among features within each stream; (ii) Cross-Encoding self-attention, which enables bidirectional interaction between raw and target-aware representations; and (iii) an Adaptive Sparse Self-Attention (ASSA) mechanism, which dynamically suppresses low-utility tokens by driving their attention weights toward zero--thereby mitigating the impact of noise. Empirical results on multiple public benchmarks show consistent gains over strong baselines, confirming that jointly modeling raw and target-aware views--while adaptively filtering noise--yields a more robust deep tabular learner.
☆ Towards Foundation Inference Models that Learn ODEs In-Context
Ordinary differential equations (ODEs) describe dynamical systems evolving deterministically in continuous time. Accurate data-driven modeling of systems as ODEs, a central problem across the natural sciences, remains challenging, especially if the data is sparse or noisy. We introduce FIM-ODE (Foundation Inference Model for ODEs), a pretrained neural model designed to estimate ODEs zero-shot (i.e., in context) from sparse and noisy observations. Trained on synthetic data, the model utilizes a flexible neural operator for robust ODE inference, even from corrupted data. We empirically verify that FIM-ODE provides accurate estimates, on par with a neural state-of-the-art method, and qualitatively compare the structure of their estimated vector fields.
☆ On Foundation Models for Temporal Point Processes to Accelerate Scientific Discovery
Many scientific fields, from medicine to seismology, rely on analyzing sequences of events over time to understand complex systems. Traditionally, machine learning models must be built and trained from scratch for each new dataset, which is a slow and costly process. We introduce a new approach: a single, powerful model that learns the underlying patterns of event data in context. We trained this "foundation model" on millions of simulated event sequences, teaching it a general-purpose understanding of how events can unfold. As a result, our model can analyze new scientific data instantly, without retraining, simply by looking at a few examples from the dataset. It can also be quickly fine-tuned for even higher accuracy. This approach makes sophisticated event analysis more accessible and accelerates the pace of scientific discovery.
☆ Contraction and entropy production in continuous-time Sinkhorn dynamics
Recently, the vanishing-step-size limit of the Sinkhorn algorithm at finite regularization parameter $\varepsilon$ was shown to be a mirror descent in the space of probability measures. We give $L^2$ contraction criteria in two time-dependent metrics induced by the mirror Hessian, which reduce to the coercivity of certain conditional expectation operators. We then give an exact identity for the entropy production rate of the Sinkhorn flow, which was previously known only to be nonpositive. Examining this rate shows that the standard semigroup analysis of diffusion processes extends systematically to the Sinkhorn flow. We show that the flow induces a reversible Markov dynamics on the target marginal as an Onsager gradient flow. We define the Dirichlet form associated to its (nonlocal) infinitesimal generator, prove a Poincar\'e inequality for it, and show that the spectral gap is strictly positive along the Sinkhorn flow whenever $\varepsilon > 0$. Lastly, we show that the entropy decay is exponential if and only if a logarithmic Sobolev inequality (LSI) holds. We give for illustration two immediate practical use-cases for the Sinkhorn LSI: as a design principle for the latent space in which generative models are trained, and as a stopping heuristic for discrete-time algorithms.
comment: 10 pages excluding references
☆ Expert or not? assessing data quality in offline reinforcement learning
Offline reinforcement learning (RL) learns exclusively from static datasets, without further interaction with the environment. In practice, such datasets vary widely in quality, often mixing expert, suboptimal, and even random trajectories. The choice of algorithm therefore depends on dataset fidelity. Behavior cloning can suffice on high-quality data, whereas mixed- or low-quality data typically benefits from offline RL methods that stitch useful behavior across trajectories. Yet in the wild it is difficult to assess dataset quality a priori because the data's provenance and skill composition are unknown. We address the problem of estimating offline dataset quality without training an agent. We study a spectrum of proxies from simple cumulative rewards to learned value based estimators, and introduce the Bellman Wasserstein distance (BWD), a value aware optimal transport score that measures how dissimilar a dataset's behavioral policy is from a random reference policy. BWD is computed from a behavioral critic and a state conditional OT formulation, requiring no environment interaction or full policy optimization. Across D4RL MuJoCo tasks, BWD strongly correlates with an oracle performance score that aggregates multiple offline RL algorithms, enabling efficient prediction of how well standard agents will perform on a given dataset. Beyond prediction, integrating BWD as a regularizer during policy optimization explicitly pushes the learned policy away from random behavior and improves returns. These results indicate that value aware, distributional signals such as BWD are practical tools for triaging offline RL datasets and policy optimization.
☆ Adapting Noise to Data: Generative Flows from 1D Processes
We introduce a general framework for constructing generative models using one-dimensional noising processes. Beyond diffusion processes, we outline examples that demonstrate the flexibility of our approach. Motivated by this, we propose a novel framework in which the 1D processes themselves are learnable, achieved by parameterizing the noise distribution through quantile functions that adapt to the data. Our construction integrates seamlessly with standard objectives, including Flow Matching and consistency models. Learning quantile-based noise naturally captures heavy tails and compact supports when present. Numerical experiments highlight both the flexibility and the effectiveness of our method.
☆ Laminar: A Scalable Asynchronous RL Post-Training Framework
Reinforcement learning (RL) post-training for Large Language Models (LLMs) is now scaling to large clusters and running for extended durations to enhance model reasoning performance. However, the scalability of existing RL frameworks is limited, as extreme long-tail skewness in RL trajectory generation causes severe GPU underutilization. Current asynchronous RL systems attempt to mitigate this, but they rely on global weight synchronization between the actor and all rollouts, which creates a rigid model update schedule. This global synchronization is ill-suited for the highly skewed and evolving distribution of trajectory generation latency in RL training, crippling training efficiency. Our key insight is that efficient scaling requires breaking this lockstep through trajectory-level asynchrony, which generates and consumes each trajectory independently. We propose Laminar, a scalable and robust RL post-training system built on a fully decoupled architecture. First, we replace global updates with a tier of relay workers acting as a distributed parameter service. This enables asynchronous and fine-grained weight synchronization, allowing rollouts to pull the latest weight anytime without stalling the actor's training loop. Second, a dynamic repack mechanism consolidates long-tail trajectories onto a few dedicated rollouts, maximizing generation throughput. The fully decoupled design also isolates failures, ensuring robustness for long-running jobs. Our evaluation on a 1024-GPU cluster shows that Laminar achieves up to 5.48$\times$ training throughput speedup over state-of-the-art systems, while reducing model convergence time.
☆ Learning-To-Measure: In-context Active Feature Acquisition
Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.
☆ Towards Fast Coarse-graining and Equation Discovery with Foundation Inference Models
High-dimensional recordings of dynamical processes are often characterized by a much smaller set of effective variables, evolving on low-dimensional manifolds. Identifying these latent dynamics requires solving two intertwined problems: discovering appropriate coarse-grained variables and simultaneously fitting the governing equations. Most machine learning approaches tackle these tasks jointly by training autoencoders together with models that enforce dynamical consistency. We propose to decouple the two problems by leveraging the recently introduced Foundation Inference Models (FIMs). FIMs are pretrained models that estimate the infinitesimal generators of dynamical systems (e.g., the drift and diffusion of a stochastic differential equation) in zero-shot mode. By amortizing the inference of the dynamics through a FIM with frozen weights, and training only the encoder-decoder map, we define a simple, simulation-consistent loss that stabilizes representation learning. A proof of concept on a stochastic double-well system with semicircle diffusion, embedded into synthetic video data, illustrates the potential of this approach for fast and reusable coarse-graining pipelines.
☆ Same model, better performance: the impact of shuffling on DNA Language Models benchmarking
Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.
☆ Rethinking Knowledge Distillation: A Data Dependent Regulariser With a Negative Asymmetric Payoff
Knowledge distillation is often considered a compression mechanism when judged on the resulting student's accuracy and loss, yet its functional impact is poorly understood. In this work, we quantify the compression capacity of knowledge distillation and the resulting knowledge transfer from a functional perspective, decoupling compression from architectural reduction, which provides an improved understanding of knowledge distillation. We employ hypothesis testing, controls, and random control distillation to understand knowledge transfer mechanisms across data modalities. To rigorously test the breadth and limits of our analyses, we explore multiple distillation variants and analyse distillation scaling laws across model sizes. Our findings demonstrate that, while there is statistically significant knowledge transfer in some modalities and architectures, the extent of this transfer is less pronounced than anticipated, even under conditions designed to maximise knowledge sharing. Notably, in cases of significant knowledge transfer, we identify a consistent and severe asymmetric transfer of negative knowledge to the student, raising safety concerns in knowledge distillation applications. Across 12 experimental setups, 9 architectures, and 7 datasets, our findings show that knowledge distillation functions less as a compression mechanism and more as a data-dependent regulariser with a negative asymmetric payoff.
comment: 45 pages, 24 figures and 104 tables
☆ Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice
Cross-silo federated learning (FL) is a promising approach to enable cross-organization collaboration in machine learning model development without directly sharing private data. Despite growing organizational interest driven by data protection regulations such as GDPR and HIPAA, the adoption of cross-silo FL remains limited in practice. In this paper, we conduct an interview study to understand the practical challenges associated with cross-silo FL adoption. With interviews spanning a diverse set of stakeholders such as user organizations, software providers, and academic researchers, we uncover various barriers, from concerns about model performance to questions of incentives and trust between participating organizations. Our study shows that cross-silo FL faces a set of challenges that have yet to be well-captured by existing research in the area and are quite distinct from other forms of federated learning such as cross-device FL. We end with a discussion on future research directions that can help overcome these challenges.
comment: Main text: 23 pages, 2 tables, 2 figures
☆ LayerSync: Self-aligning Intermediate Layers
We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75x on ImageNet dataset and improved the generation quality by 23.6%. The code is available at https://github.com/vita-epfl/LayerSync.
☆ CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving
End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training. CoIRL-AD introduces a competition-based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.
comment: 18 pages, 17 figures
☆ Universal Adaptive Environment Discovery
An open problem in Machine Learning is how to avoid models to exploit spurious correlations in the data; a famous example is the background-label shortcut in the Waterbirds dataset. A common remedy is to train a model across multiple environments; in the Waterbirds dataset, this corresponds to training by randomizing the background. However, selecting the right environments is a challenging problem, given that these are rarely known a priori. We propose Universal Adaptive Environment Discovery (UAED), a unified framework that learns a distribution over data transformations that instantiate environments, and optimizes any robust objective averaged over this learned distribution. UAED yields adaptive variants of IRM, REx, GroupDRO, and CORAL without predefined groups or manual environment design. We provide a theoretical analysis by providing PAC-Bayes bounds and by showing robustness to test environment distributions under standard conditions. Empirically, UAED discovers interpretable environment distributions and improves worst-case accuracy on standard benchmarks, while remaining competitive on mean accuracy. Our results indicate that making environments adaptive is a practical route to out-of-distribution generalization.
comment: 8 papes in the main body, 4 pages in the appendix, 4 figures and 9 tables overall, conference
☆ Evaluation of Real-Time Preprocessing Methods in AI-Based ECG Signal Analysis
The increasing popularity of portable ECG systems and the growing demand for privacy-compliant, energy-efficient real-time analysis require new approaches to signal processing at the point of data acquisition. In this context, the edge domain is acquiring increasing importance, as it not only reduces latency times, but also enables an increased level of data security. The FACE project aims to develop an innovative machine learning solution for analysing long-term electrocardiograms that synergistically combines the strengths of edge and cloud computing. In this thesis, various pre-processing steps of ECG signals are analysed with regard to their applicability in the project. The selection of suitable methods in the edge area is based in particular on criteria such as energy efficiency, processing capability and real-time capability.
comment: Conference paper for 2025 IEEE World AI IoT Congress (AIIoT), FACE Project, University of Siegen, Germany
☆ Multi-Armed Bandits with Minimum Aggregated Revenue Constraints
We examine a multi-armed bandit problem with contextual information, where the objective is to ensure that each arm receives a minimum aggregated reward across contexts while simultaneously maximizing the total cumulative reward. This framework captures a broad class of real-world applications where fair revenue allocation is critical and contextual variation is inherent. The cross-context aggregation of minimum reward constraints, while enabling better performance and easier feasibility, introduces significant technical challenges -- particularly the absence of closed-form optimal allocations typically available in standard MAB settings. We design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction. For each algorithm, we derive problem-dependent upper bounds on both regret and constraint violations. Furthermore, we establish a lower bound demonstrating that the dependence on the time horizon in our results is optimal in general and revealing fundamental limitations of the free exploration principle leveraged in prior work.
☆ Why the noise model matters: A performance gap in learned regularization
This article addresses the challenge of learning effective regularizers for linear inverse problems. We analyze and compare several types of learned variational regularization against the theoretical benchmark of the optimal affine reconstruction, i.e. the best possible affine linear map for minimizing the mean squared error. It is known that this optimal reconstruction can be achieved using Tikhonov regularization, but this requires precise knowledge of the noise covariance to properly weight the data fidelity term. However, in many practical applications, noise statistics are unknown. We therefore investigate the performance of regularization methods learned without access to this noise information, focusing on Tikhonov, Lavrentiev, and quadratic regularization. Our theoretical analysis and numerical experiments demonstrate that for non-white noise, a performance gap emerges between these methods and the optimal affine reconstruction. Furthermore, we show that these different types of regularization yield distinct results, highlighting that the choice of regularizer structure is critical when the noise model is not explicitly learned. Our findings underscore the significant value of accurately modeling or co-learning noise statistics in data-driven regularization.
☆ The Robustness of Differentiable Causal Discovery in Misspecified Scenarios ICLR 2025
Causal discovery aims to learn causal relationships between variables from targeted data, making it a fundamental task in machine learning. However, causal discovery algorithms often rely on unverifiable causal assumptions, which are usually difficult to satisfy in real-world data, thereby limiting the broad application of causal discovery in practical scenarios. Inspired by these considerations, this work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit robustness under the metrics of Structural Hamming Distance and Structural Intervention Distance of the inferred graphs in commonly used challenging scenarios, except for scale variation. We also provide the theoretical explanations for the performance of differentiable causal discovery methods. Finally, our work aims to comprehensively benchmark the performance of recent differentiable causal discovery methods under model assumption violations, and provide the standard for reasonable evaluation of causal discovery, as well as to further promote its application in real-world scenarios.
comment: accepted to ICLR 2025
☆ Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance
Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.
☆ PubSub-VFL: Towards Efficient Two-Party Split Learning in Heterogeneous Environments via Publisher/Subscriber Architecture NeurIPS 2025
With the rapid advancement of the digital economy, data collaboration between organizations has become a well-established business model, driving the growth of various industries. However, privacy concerns make direct data sharing impractical. To address this, Two-Party Split Learning (a.k.a. Vertical Federated Learning (VFL)) has emerged as a promising solution for secure collaborative learning. Despite its advantages, this architecture still suffers from low computational resource utilization and training efficiency. Specifically, its synchronous dependency design increases training latency, while resource and data heterogeneity among participants further hinder efficient computation. To overcome these challenges, we propose PubSub-VFL, a novel VFL paradigm with a Publisher/Subscriber architecture optimized for two-party collaborative learning with high computational efficiency. PubSub-VFL leverages the decoupling capabilities of the Pub/Sub architecture and the data parallelism of the parameter server architecture to design a hierarchical asynchronous mechanism, reducing training latency and improving system efficiency. Additionally, to mitigate the training imbalance caused by resource and data heterogeneity, we formalize an optimization problem based on participants' system profiles, enabling the selection of optimal hyperparameters while preserving privacy. We conduct a theoretical analysis to demonstrate that PubSub-VFL achieves stable convergence and is compatible with security protocols such as differential privacy. Extensive case studies on five benchmark datasets further validate its effectiveness, showing that, compared to state-of-the-art baselines, PubSub-VFL not only accelerates training by $2 \sim 7\times$ without compromising accuracy, but also achieves a computational resource utilization rate of up to 91.07%.
comment: Accepted at NeurIPS 2025
☆ CrossAD: Time Series Anomaly Detection with Cross-scale Associations and Cross-window Modeling
Time series anomaly detection plays a crucial role in a wide range of real-world applications. Given that time series data can exhibit different patterns at different sampling granularities, multi-scale modeling has proven beneficial for uncovering latent anomaly patterns that may not be apparent at a single scale. However, existing methods often model multi-scale information independently or rely on simple feature fusion strategies, neglecting the dynamic changes in cross-scale associations that occur during anomalies. Moreover, most approaches perform multi-scale modeling based on fixed sliding windows, which limits their ability to capture comprehensive contextual information. In this work, we propose CrossAD, a novel framework for time series Anomaly Detection that takes Cross-scale associations and Cross-window modeling into account. We propose a cross-scale reconstruction that reconstructs fine-grained series from coarser series, explicitly capturing cross-scale associations. Furthermore, we design a query library and incorporate global multi-scale context to overcome the limitations imposed by fixed window sizes. Extensive experiments conducted on multiple real-world datasets using nine evaluation metrics validate the effectiveness of CrossAD, demonstrating state-of-the-art performance in anomaly detection.
comment: Accepted by the thirty-ninth annual conference on Neural Information Processing Systems
☆ Diff-XYZ: A Benchmark for Evaluating Diff Understanding
Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.
☆ SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
comment: Accepted by EMNLP2025
☆ Time-Correlated Video Bridge Matching
Diffusion models excel in noise-to-data generation tasks, providing a mapping from a Gaussian distribution to a more complex data distribution. However they struggle to model translations between complex distributions, limiting their effectiveness in data-to-data tasks. While Bridge Matching (BM) models address this by finding the translation between data distributions, their application to time-correlated data sequences remains unexplored. This is a critical limitation for video generation and manipulation tasks, where maintaining temporal coherence is particularly important. To address this gap, we propose Time-Correlated Video Bridge Matching (TCVBM), a framework that extends BM to time-correlated data sequences in the video domain. TCVBM explicitly models inter-sequence dependencies within the diffusion bridge, directly incorporating temporal correlations into the sampling process. We compare our approach to classical methods based on bridge matching and diffusion models for three video-related tasks: frame interpolation, image-to-video generation, and video super-resolution. TCVBM achieves superior performance across multiple quantitative metrics, demonstrating enhanced generation quality and reconstruction fidelity.
☆ A Function Centric Perspective On Flat and Sharp Minima
Flat minima are widely believed to correlate with improved generalisation in deep neural networks. However, this connection has proven more nuanced in recent studies, with both theoretical counterexamples and empirical exceptions emerging in the literature. In this paper, we revisit the role of sharpness in model performance, proposing that sharpness is better understood as a function-dependent property rather than a reliable indicator of poor generalisation. We conduct extensive empirical studies, from single-objective optimisation to modern image classification tasks, showing that sharper minima often emerge when models are regularised (e.g., via SAM, weight decay, or data augmentation), and that these sharp minima can coincide with better generalisation, calibration, robustness, and functional consistency. Across a range of models and datasets, we find that baselines without regularisation tend to converge to flatter minima yet often perform worse across all safety metrics. Our findings demonstrate that function complexity, rather than flatness alone, governs the geometry of solutions, and that sharper minima can reflect more appropriate inductive biases (especially under regularisation), calling for a function-centric reappraisal of loss landscape geometry.
comment: 26 pages, 26 tables, 63 figures, pre-print
☆ Bayesian Optimization for Dynamic Pricing and Learning
Dynamic pricing is the practice of adjusting the selling price of a product to maximize a firm's revenue by responding to market demand. The literature typically distinguishes between two settings: infinite inventory, where the firm has unlimited stock and time to sell, and finite inventory, where both inventory and selling horizon are limited. In both cases, the central challenge lies in the fact that the demand function -- how sales respond to price -- is unknown and must be learned from data. Traditional approaches often assume a specific parametric form for the demand function, enabling the use of reinforcement learning (RL) to identify near-optimal pricing strategies. However, such assumptions may not hold in real-world scenarios, limiting the applicability of these methods. In this work, we propose a Gaussian Process (GP) based nonparametric approach to dynamic pricing that avoids restrictive modeling assumptions. We treat the demand function as a black-box function of the price and develop pricing algorithms based on Bayesian Optimization (BO) -- a sample-efficient method for optimizing unknown functions. We present BO-based algorithms tailored for both infinite and finite inventory settings and provide regret guarantees for both regimes, thereby quantifying the learning efficiency of our methods. Through extensive experiments, we demonstrate that our BO-based methods outperform several state-of-the-art RL algorithms in terms of revenue, while requiring fewer assumptions and offering greater robustness. This highlights Bayesian Optimization as a powerful and practical tool for dynamic pricing in complex, uncertain environments.
☆ Formal Models and Convergence Analysis for Context-Aware Security Verification
We present a formal framework for context-aware security verification that establishes provable guarantees for ML-enhanced adaptive systems. We introduce context-completeness - a new security property - and prove: (1) sample complexity bounds showing when adaptive verification succeeds, (2) information-theoretic limits relating context richness to detection capability, (3) convergence guarantees for ML-based payload generators, and (4) compositional soundness bounds. We further provide a formal separation between static context-blind verifiers and context-aware adaptive verifiers: for a natural family of targets, any static verifier with finite payload budget achieves completeness at most alpha, while a context-aware verifier with sufficient information achieves completeness greater than alpha. We validate our theoretical predictions through controlled experiments on 97,224 exploit samples, demonstrating: detection accuracy improving from 58% to 69.93% with dataset growth, success probability increasing from 51% to 82% with context enrichment, training loss converging at O(1/sqrt(T)) rate, and false positive rate (10.19%) within theoretical bounds (12%). Our results show that theoretically-grounded adaptive verification achieves provable improvements over static approaches under stated assumptions while maintaining soundness guarantees.
comment: 11 pages, 4 figures, 4 tables. Presents formal framework for context-aware security verification with ML-enhanced adaptive systems. Includes theoretical bounds (sample complexity, information-theoretic limits, convergence guarantees, soundness preservation) and empirical validation on 97,224 exploit samples
☆ Neural Guided Sampling for Quantum Circuit Optimization
Translating a general quantum circuit on a specific hardware topology with a reduced set of available gates, also known as transpilation, comes with a substantial increase in the length of the equivalent circuit. Due to decoherence, the quality of the computational outcome can degrade seriously with increasing circuit length. Thus, there is major interest to reduce a quantum circuit to an equivalent circuit which is in its gate count as short as possible. One method to address efficient transpilation is based on approaches known from stochastic optimization, e.g. by using random sampling and token replacement strategies. Here, a core challenge is that these methods can suffer from sampling efficiency, causing long and energy consuming optimization time. As a remedy, we propose in this work 2D neural guided sampling. Thus, given a 2D representation of a quantum circuit, a neural network predicts groups of gates in the quantum circuit, which are likely reducible. Thus, it leads to a sampling prior which can heavily reduce the compute time for quantum circuit reduction. In several experiments, we demonstrate that our method is superior to results obtained from different qiskit or BQSKit optimization levels.
comment: 12 pages, 9 Figures
☆ Geopolitics, Geoeconomics and Risk:A Machine Learning Approach
We introduce a novel high-frequency daily panel dataset of both markets and news-based indicators -- including Geopolitical Risk, Economic Policy Uncertainty, Trade Policy Uncertainty, and Political Sentiment -- for 42 countries across both emerging and developed markets. Using this dataset, we study how sentiment dynamics shape sovereign risk, measured by Credit Default Swap (CDS) spreads, and evaluate their forecasting value relative to traditional drivers such as global monetary policy and market volatility. Our horse-race analysis of forecasting models demonstrates that incorporating news-based indicators significantly enhances predictive accuracy and enriches the analysis, with non-linear machine learning methods -- particularly Random Forests -- delivering the largest gains. Our analysis reveals that while global financial variables remain the dominant drivers of sovereign risk, geopolitical risk and economic policy uncertainty also play a meaningful role. Crucially, their effects are amplified through non-linear interactions with global financial conditions. Finally, we document pronounced regional heterogeneity, as certain asset classes and emerging markets exhibit heightened sensitivity to shocks in policy rates, global financial volatility, and geopolitical risk.
☆ Continuous Uniqueness and Novelty Metrics for Generative Modeling of Inorganic Crystals NeurIPS 2025
To address pressing scientific challenges such as climate change, increasingly sophisticated generative artificial intelligence models are being developed that can efficiently sample the large chemical space of possible functional materials. These models can quickly sample new chemical compositions paired with crystal structures. They are typically evaluated using uniqueness and novelty metrics, which depend on a chosen crystal distance function. However, the most prevalent distance function has four limitations: it fails to quantify the degree of similarity between compounds, cannot distinguish compositional difference and structural difference, lacks Lipschitz continuity against shifts in atomic coordinates, and results in a uniqueness metric that is not invariant against the permutation of generated samples. In this work, we propose using two continuous distance functions to evaluate uniqueness and novelty, which theoretically overcome these limitations. Our experiments show that these distances reveal insights missed by traditional distance functions, providing a more reliable basis for evaluating and comparing generative models for inorganic crystals.
comment: 13 pages (5 pages of main text), accepted to the AI4Mat workshop at NeurIPS 2025. See https://github.com/WMD-group/xtalmet for the code
☆ Robot Learning: A Tutorial
Robot learning is at an inflection point, driven by rapid advancements in machine learning and the growing availability of large-scale robotics data. This shift from classical, model-based methods to data-driven, learning-based paradigms is unlocking unprecedented capabilities in autonomous systems. This tutorial navigates the landscape of modern robot learning, charting a course from the foundational principles of Reinforcement Learning and Behavioral Cloning to generalist, language-conditioned models capable of operating across diverse tasks and even robot embodiments. This work is intended as a guide for researchers and practitioners, and our goal is to equip the reader with the conceptual understanding and practical tools necessary to contribute to developments in robot learning, with ready-to-use examples implemented in $\texttt{lerobot}$.
comment: Tutorial on Robot Learning using LeRobot, the end-to-end robot learning library developed by Hugging Face
☆ Cautious Weight Decay
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
☆ Enhanced Pre-training of Graph Neural Networks for Million-Scale Heterogeneous Graphs
In recent years, graph neural networks (GNNs) have facilitated the development of graph data mining. However, training GNNs requires sufficient labeled task-specific data, which is expensive and sometimes unavailable. To be less dependent on labeled data, recent studies propose to pre-train GNNs in a self-supervised manner and then apply the pre-trained GNNs to downstream tasks with limited labeled data. However, most existing methods are designed solely for homogeneous graphs (real-world graphs are mostly heterogeneous) and do not consider semantic mismatch (the semantic difference between the original data and the ideal data containing more transferable semantic information). In this paper, we propose an effective framework to pre-train GNNs on the large-scale heterogeneous graph. We first design a structure-aware pre-training task, which aims to capture structural properties in heterogeneous graphs. Then, we design a semantic-aware pre-training task to tackle the mismatch. Specifically, we construct a perturbation subspace composed of semantic neighbors to help deal with the semantic mismatch. Semantic neighbors make the model focus more on the general knowledge in the semantic space, which in turn assists the model in learning knowledge with better transferability. Finally, extensive experiments are conducted on real-world large-scale heterogeneous graphs to demonstrate the superiority of the proposed method over state-of-the-art baselines. Code available at https://github.com/sunshy-1/PHE.
comment: 26 pages
☆ Improving Generative Behavior Cloning via Self-Guidance and Adaptive Chunking NeurIPS25
Generative Behavior Cloning (GBC) is a simple yet effective framework for robot learning, particularly in multi-task settings. Recent GBC methods often employ diffusion policies with open-loop (OL) control, where actions are generated via a diffusion process and executed in multi-step chunks without replanning. While this approach has demonstrated strong success rates and generalization, its inherent stochasticity can result in erroneous action sampling, occasionally leading to unexpected task failures. Moreover, OL control suffers from delayed responses, which can degrade performance in noisy or dynamic environments. To address these limitations, we propose two novel techniques to enhance the consistency and reactivity of diffusion policies: (1) self-guidance, which improves action fidelity by leveraging past observations and implicitly promoting future-aware behavior; and (2) adaptive chunking, which selectively updates action sequences when the benefits of reactivity outweigh the need for temporal consistency. Extensive experiments show that our approach substantially improves GBC performance across a wide range of simulated and real-world robotic manipulation tasks. Our code is available at https://github.com/junhyukso/SGAC
comment: Accepted at NeurIPS25
☆ Towards Cross-Modal Error Detection with Tables and Images
Ensuring data quality at scale remains a persistent challenge for large organizations. Despite recent advances, maintaining accurate and consistent data is still complex, especially when dealing with multiple data modalities. Traditional error detection and correction methods tend to focus on a single modality, typically a table, and often miss cross-modal errors that are common in domains like e-Commerce and healthcare, where image, tabular, and text data co-exist. To address this gap, we take an initial step towards cross-modal error detection in tabular data, by benchmarking several methods. Our evaluation spans four datasets and five baseline approaches. Among them, Cleanlab, a label error detection framework, and DataScope, a data valuation method, perform the best when paired with a strong AutoML framework, achieving the highest F1 scores. Our findings indicate that current methods remain limited, particularly when applied to heavy-tailed real-world data, motivating further research in this area.
LiteVPNet: A Lightweight Network for Video Encoding Control in Quality-Critical Applications
In the last decade, video workflows in the cinema production ecosystem have presented new use cases for video streaming technology. These new workflows, e.g. in On-set Virtual Production, present the challenge of requiring precise quality control and energy efficiency. Existing approaches to transcoding often fall short of these requirements, either due to a lack of quality control or computational overhead. To fill this gap, we present a lightweight neural network (LiteVPNet) for accurately predicting Quantisation Parameters for NVENC AV1 encoders that achieve a specified VMAF score. We use low-complexity features, including bitstream characteristics, video complexity measures, and CLIP-based semantic embeddings. Our results demonstrate that LiteVPNet achieves mean VMAF errors below 1.2 points across a wide range of quality targets. Notably, LiteVPNet achieves VMAF errors within 2 points for over 87% of our test corpus, c.f. approx 61% with state-of-the-art methods. LiteVPNet's performance across various quality regions highlights its applicability for enhancing high-value content transport and streaming for more energy-efficient, high-quality media experiences.
comment: Accepted PCS 2025 Camera-Ready Version, 5 Pages
☆ Deep Attention-guided Adaptive Subsampling
Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.
☆ Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approximation
In this paper, we refine the Berry-Esseen bounds for the multivariate normal approximation of Polyak-Ruppert averaged iterates arising from the linear stochastic approximation (LSA) algorithm with decreasing step size. We consider the normal approximation by the Gaussian distribution with covariance matrix predicted by the Polyak-Juditsky central limit theorem and establish the rate up to order $n^{-1/3}$ in convex distance, where $n$ is the number of samples used in the algorithm. We also prove a non-asymptotic validity of the multiplier bootstrap procedure for approximating the distribution of the rescaled error of the averaged LSA estimator. We establish approximation rates of order up to $1/\sqrt{n}$ for the latter distribution, which significantly improves upon the previous results obtained by Samsonov et al. (2024).
☆ Constrained Sensing and Reliable State Estimation with Shallow Recurrent Decoders on a TRIGA Mark II Reactor
Shallow Recurrent Decoder networks are a novel data-driven methodology able to provide accurate state estimation in engineering systems, such as nuclear reactors. This deep learning architecture is a robust technique designed to map the temporal trajectories of a few sparse measures to the full state space, including unobservable fields, which is agnostic to sensor positions and able to handle noisy data through an ensemble strategy, leveraging the short training times and without the need for hyperparameter tuning. Following its application to a novel reactor concept, this work investigates the performance of Shallow Recurrent Decoders when applied to a real system. The underlying model is represented by a fluid dynamics model of the TRIGA Mark II research reactor; the architecture will use both synthetic temperature data coming from the numerical model and leveraging experimental temperature data recorded during a previous campaign. The objective of this work is, therefore, two-fold: 1) assessing if the architecture can reconstruct the full state of the system (temperature, velocity, pressure, turbulence quantities) given sparse data located in specific, low-dynamics channels and 2) assessing the correction capabilities of the architecture (that is, given a discrepancy between model and data, assessing if sparse measurements can provide some correction to the architecture output). As will be shown, the accurate reconstruction of every characteristic field, using both synthetic and experimental data, in real-time makes this approach suitable for interpretable monitoring and control purposes in the framework of a reactor digital twin.
☆ Pretraining in Actor-Critic Reinforcement Learning for Robot Motion Control ICLR 2026
The pretraining-finetuning paradigm has facilitated numerous transformative advancements in artificial intelligence research in recent years. However, in the domain of reinforcement learning (RL) for robot motion control, individual skills are often learned from scratch despite the high likelihood that some generalizable knowledge is shared across all task-specific policies belonging to a single robot embodiment. This work aims to define a paradigm for pretraining neural network models that encapsulate such knowledge and can subsequently serve as a basis for warm-starting the RL process in classic actor-critic algorithms, such as Proximal Policy Optimization (PPO). We begin with a task-agnostic exploration-based data collection algorithm to gather diverse, dynamic transition data, which is then used to train a Proprioceptive Inverse Dynamics Model (PIDM) through supervised learning. The pretrained weights are loaded into both the actor and critic networks to warm-start the policy optimization of actual tasks. We systematically validated our proposed method on seven distinct robot motion control tasks, showing significant benefits to this initialization strategy. Our proposed approach on average improves sample efficiency by 40.1% and task performance by 7.5%, compared to random initialization. We further present key ablation studies and empirical analyses that shed light on the mechanisms behind the effectiveness of our method.
comment: Submitted to ICLR 2026
☆ Traveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models
As users increasingly interact with large language models (LLMs) using private information, secure and encrypted communication becomes essential. Homomorphic encryption (HE) provides a principled solution by enabling computation directly on encrypted data. Although prior work has explored aspects of running LLMs under HE, the challenge of text generation, particularly next-token prediction, has received limited attention and remains a key obstacle to practical encrypted interaction. In this work, we propose a TSP-based token reordering strategy to address the difficulties of encrypted text generation, together with a post-processing step that further reduces approximation error. Theoretical analysis and experimental results demonstrate that our method prevents collapse, improves coherence in generated text, and preserves data privacy throughout. Overall, our contributions advance the feasibility of practical and privacy-preserving LLM inference.
comment: 34 pages
☆ Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.
☆ Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand
Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce fine-resolution maps that support decision-making in long-term water management.
☆ Deep SPI: Safe Policy Improvement via World Models
Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, "deep" analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on-policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE-57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.
comment: 10 pages main text, 17 pages appendix (excluding references)
☆ Learning Latent Energy-Based Models via Interacting Particle Langevin Dynamics
We develop interacting particle algorithms for learning latent variable models with energy-based priors. To do so, we leverage recent developments in particle-based methods for solving maximum marginal likelihood estimation (MMLE) problems. Specifically, we provide a continuous-time framework for learning latent energy-based models, by defining stochastic differential equations (SDEs) that provably solve the MMLE problem. We obtain a practical algorithm as a discretisation of these SDEs and provide theoretical guarantees for the convergence of the proposed algorithm. Finally, we demonstrate the empirical effectiveness of our method on synthetic and image datasets.
☆ DeepTrust: Multi-Step Classification through Dissimilar Adversarial Representations for Robust Android Malware Detection
Over the last decade, machine learning has been extensively applied to identify malicious Android applications. However, such approaches remain vulnerable against adversarial examples, i.e., examples that are subtly manipulated to fool a machine learning model into making incorrect predictions. This research presents DeepTrust, a novel metaheuristic that arranges flexible classifiers, like deep neural networks, into an ordered sequence where the final decision is made by a single internal model based on conditions activated in cascade. In the Robust Android Malware Detection competition at the 2025 IEEE Conference SaTML, DeepTrust secured the first place and achieved state-of-the-art results, outperforming the next-best competitor by up to 266% under feature-space evasion attacks. This is accomplished while maintaining the highest detection rate on non-adversarial malware and a false positive rate below 1%. The method's efficacy stems from maximizing the divergence of the learned representations among the internal models. By using classifiers inducing fundamentally dissimilar embeddings of the data, the decision space becomes unpredictable for an attacker. This frustrates the iterative perturbation process inherent to evasion attacks, enhancing system robustness without compromising accuracy on clean examples.
☆ General Fourier Feature Physics-Informed Extreme Learning Machine (GFF-PIELM) for High-Frequency PDEs
Conventional physics-informed extreme learning machine (PIELM) often faces challenges in solving partial differential equations (PDEs) involving high-frequency and variable-frequency behaviors. To address these challenges, we propose a general Fourier feature physics-informed extreme learning machine (GFF-PIELM). We demonstrate that directly concatenating multiple Fourier feature mappings (FFMs) and an extreme learning machine (ELM) network makes it difficult to determine frequency-related hyperparameters. Fortunately, we find an alternative to establish the GFF-PIELM in three main steps. First, we integrate a variation of FFM into ELM as the Fourier-based activation function, so there is still one hidden layer in the GFF-PIELM framework. Second, we assign a set of frequency coefficients to the hidden neurons, which enables ELM network to capture diverse frequency components of target solutions. Finally, we develop an innovative, straightforward initialization method for these hyperparameters by monitoring the distribution of ELM output weights. GFF-PIELM not only retains the high accuracy, efficiency, and simplicity of the PIELM framework but also inherits the ability of FFMs to effectively handle high-frequency problems. We carry out five case studies with a total of ten numerical examples to highlight the feasibility and validity of the proposed GFF-PIELM, involving high frequency, variable frequency, multi-scale behaviour, irregular boundary and inverse problems. Compared to conventional PIELM, the GFF-PIELM approach significantly improves predictive accuracy without additional cost in training time and architecture complexity. Our results confirm that that PIELM can be extended to solve high-frequency and variable-frequency PDEs with high accuracy, and our initialization strategy may further inspire advances in other physics-informed machine learning (PIML) frameworks.
☆ Multi-Action Self-Improvement for Neural Combinatorial Optimization
Self-improvement has emerged as a state-of-the-art paradigm in Neural Combinatorial Optimization (NCO), where models iteratively refine their policies by generating and imitating high-quality solutions. Despite strong empirical performance, existing methods face key limitations. Training is computationally expensive, as policy updates require sampling numerous candidate solutions per instance to extract a single expert trajectory. More fundamentally, these approaches fail to exploit the structure of combinatorial problems involving the coordination of multiple agents, such as vehicles in min-max routing or machines in scheduling. By supervising on single-action trajectories, they fail to exploit agent-permutation symmetries, where distinct sequences of actions yield identical solutions, hindering generalization and the ability to learn coordinated behavior. We address these challenges by extending self-improvement to operate over joint multi-agent actions. Our model architecture predicts complete agent-task assignments jointly at each decision step. To explicitly leverage symmetries, we employ a set-prediction loss, which supervises the policy on multiple expert assignments for any given state. This approach enhances sample efficiency and the model's ability to learn coordinated behavior. Furthermore, by generating multi-agent actions in parallel, it drastically accelerates the solution generation phase of the self-improvement loop. Empirically, we validate our method on several combinatorial problems, demonstrating consistent improvements in the quality of the final solution and a reduced generation latency compared to standard self-improvement.
☆ Heterogeneous RBCs via deep multi-agent reinforcement learning
Current macroeconomic models with agent heterogeneity can be broadly divided into two main groups. Heterogeneous-agent general equilibrium (GE) models, such as those based on Heterogeneous Agents New Keynesian (HANK) or Krusell-Smith (KS) approaches, rely on GE and 'rational expectations', somewhat unrealistic assumptions that make the models very computationally cumbersome, which in turn limits the amount of heterogeneity that can be modelled. In contrast, agent-based models (ABMs) can flexibly encompass a large number of arbitrarily heterogeneous agents, but typically require the specification of explicit behavioural rules, which can lead to a lengthy trial-and-error model-development process. To address these limitations, we introduce MARL-BC, a framework that integrates deep multi-agent reinforcement learning (MARL) with Real Business Cycle (RBC) models. We demonstrate that MARL-BC can: (1) recover textbook RBC results when using a single agent; (2) recover the results of the mean-field KS model using a large number of identical agents; and (3) effectively simulate rich heterogeneity among agents, a hard task for traditional GE approaches. Our framework can be thought of as an ABM if used with a variety of heterogeneous interacting agents, and can reproduce GE results in limit cases. As such, it is a step towards a synthesis of these often opposed modelling paradigms.
comment: 13 pages, 9 figures
☆ The Living Forecast: Evolving Day-Ahead Predictions into Intraday Reality
Accurate intraday forecasts are essential for power system operations, complementing day-ahead forecasts that gradually lose relevance as new information becomes available. This paper introduces a Bayesian updating mechanism that converts fully probabilistic day-ahead forecasts into intraday forecasts without retraining or re-inference. The approach conditions the Gaussian mixture output of a conditional variational autoencoder-based forecaster on observed measurements, yielding an updated distribution for the remaining horizon that preserves its probabilistic structure. This enables consistent point, quantile, and ensemble forecasts while remaining computationally efficient and suitable for real-time applications. Experiments on household electricity consumption and photovoltaic generation datasets demonstrate that the proposed method improves forecast accuracy up to 25% across likelihood-, sample-, quantile-, and point-based metrics. The largest gains occur in time steps with strong temporal correlation to observed data, and the use of pattern dictionary-based covariance structures further enhances performance. The results highlight a theoretically grounded framework for intraday forecasting in modern power systems.
☆ Tensor Logic: The Language of AI
Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP an Prolog lack scalability and support for learning. This paper proposes tensor logic, a language that solves these problems by unifying neural and symbolic AI at a fundamental level. The sole construct in tensor logic is the tensor equation, based on the observation that logical rules and Einstein summation are essentially the same operation, and all else can be reduced to them. I show how to elegantly implement key forms of neural, symbolic and statistical AI in tensor logic, including transformers, formal reasoning, kernel machines and graphical models. Most importantly, tensor logic makes new directions possible, such as sound reasoning in embedding space. This combines the scalability and learnability of neural networks with the reliability and transparency of symbolic reasoning, and is potentially a basis for the wider adoption of AI.
comment: 17 pages, 0 figures
☆ HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization
Low-Rank Adaptation (LoRA) has emerged as a widely used technique for adapting large language models (LLMs) to new domains, due to its modular design and broad availability on platforms such as HuggingFace. This availability has motivated efforts to reuse existing LoRAs for domain generalization. However, existing methods often rely on explicit task labels or additional training, which are impractical for deployment. Moreover, they typically activate a fixed number of entire LoRA modules, leading to parameter redundancy or insufficiency that degrade performance. In this paper, we propose \texttt{HiLoRA}, a training-free framework that performs adaptive hierarchical routing over LoRA pools. Drawing on structural properties of LoRA, we define rank-one components (ROCs), in which each rank parameter is regarded as an independent unit. For a given input sequence, \texttt{HiLoRA} first adaptively selects a subset of LoRAs and determines their ROC allocation based on Gaussian likelihoods at the sequence level. At the token level, it further refines routing by activating only the most informative ROCs. We further provide theoretical guarantees that \texttt{HiLoRA} selects the most relevant LoRAs with high probability. Extensive experiments show that \texttt{HiLoRA} achieves substantial improvements in domain generalization, with accuracy gains of up to {\small $55\%$} over state-of-the-art baselines, while maintaining comparable inference throughput.
☆ AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion
Visible-infrared image fusion is crucial in key applications such as autonomous driving and nighttime surveillance. Its main goal is to integrate multimodal information to produce enhanced images that are better suited for downstream tasks. Although deep learning based fusion methods have made significant progress, mainstream unsupervised approaches still face serious challenges in practical applications. Existing methods mostly rely on manually designed loss functions to guide the fusion process. However, these loss functions have obvious limitations. On one hand, the reference images constructed by existing methods often lack details and have uneven brightness. On the other hand, the widely used gradient losses focus only on gradient magnitude. To address these challenges, this paper proposes an angle-based perception framework for spatial-sensitive image fusion (AngularFuse). At first, we design a cross-modal complementary mask module to force the network to learn complementary information between modalities. Then, a fine-grained reference image synthesis strategy is introduced. By combining Laplacian edge enhancement with adaptive histogram equalization, reference images with richer details and more balanced brightness are generated. Last but not least, we introduce an angle-aware loss, which for the first time constrains both gradient magnitude and direction simultaneously in the gradient domain. AngularFuse ensures that the fused images preserve both texture intensity and correct edge orientation. Comprehensive experiments on the MSRS, RoadScene, and M3FD public datasets show that AngularFuse outperforms existing mainstream methods with clear margin. Visual comparisons further confirm that our method produces sharper and more detailed results in challenging scenes, demonstrating superior fusion capability.
comment: For the first time, angle-based perception was introduced into the multi-modality image fusion task
☆ FedMMKT:Co-Enhancing a Server Text-to-Image Model and Client Task Models in Multi-Modal Federated Learning
Text-to-Image (T2I) models have demonstrated their versatility in a wide range of applications. However, adaptation of T2I models to specialized tasks is often limited by the availability of task-specific data due to privacy concerns. On the other hand, harnessing the power of rich multimodal data from modern mobile systems and IoT infrastructures presents a great opportunity. This paper introduces Federated Multi-modal Knowledge Transfer (FedMMKT), a novel framework that enables co-enhancement of a server T2I model and client task-specific models using decentralized multimodal data without compromising data privacy.
Diffusion Models for Reinforcement Learning: Foundations, Taxonomy, and Development
Diffusion Models (DMs), as a leading class of generative models, offer key advantages for reinforcement learning (RL), including multi-modal expressiveness, stable training, and trajectory-level planning. This survey delivers a comprehensive and up-to-date synthesis of diffusion-based RL. We first provide an overview of RL, highlighting its challenges, and then introduce the fundamental concepts of DMs, investigating how they are integrated into RL frameworks to address key challenges in this research field. We establish a dual-axis taxonomy that organizes the field along two orthogonal dimensions: a function-oriented taxonomy that clarifies the roles DMs play within the RL pipeline, and a technique-oriented taxonomy that situates implementations across online versus offline learning regimes. We also provide a comprehensive examination of this progression from single-agent to multi-agent domains, thereby forming several frameworks for DM-RL integration and highlighting their practical utility. Furthermore, we outline several categories of successful applications of diffusion-based RL across diverse domains, discuss open research issues of current methodologies, and highlight key directions for future research to advance the field. Finally, we summarize the survey to identify promising future development directions. We are actively maintaining a GitHub repository (https://github.com/ChangfuXu/D4RL-FTD) for papers and other related resources to apply DMs for RL.
comment: Under Review
☆ Optimal Regularization for Performative Learning
In performative learning, the data distribution reacts to the deployed model - for example, because strategic users adapt their features to game it - which creates a more complex dynamic than in classical supervised learning. One should thus not only optimize the model for the current data but also take into account that the model might steer the distribution in a new direction, without knowing the exact nature of the potential shift. We explore how regularization can help cope with performative effects by studying its impact in high-dimensional ridge regression. We show that, while performative effects worsen the test risk in the population setting, they can be beneficial in the over-parameterized regime where the number of features exceeds the number of samples. We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect. We illustrate this finding through empirical evaluations of the optimal regularization parameter on both synthetic and real-world datasets.
☆ MoRA: On-the-fly Molecule-aware Low-Rank Adaptation Framework for LLM-based Multi-Modal Molecular Assistant
Effectively integrating molecular graph structures with Large Language Models (LLMs) is a key challenge in drug discovery. Most existing multi-modal alignment methods typically process these structures by fine-tuning the LLM or adding a static adapter simultaneously. However, these approaches have two main limitations: (1) it optimizes a shared parameter space across all molecular inputs, limiting the model's ability to capture instance-specific structural features; and (2) fine-tuning the LLM for molecular tasks can lead to catastrophic forgetting, undermining its general reasoning capabilities. In this paper, instead of static task-oriented adaptation, we propose an instance-specific parameter space alignment approach for each molecule on-the-fly. To this end, we introduce Molecule-aware Low-Rank Adaptation (MoRA) that produces a unique set of low-rank adaptation weights for each input molecular graph. These weights are then dynamically injected into a frozen LLM, allowing the model to adapt its reasoning to the structure of each molecular input, while preserving the LLM's core knowledge. Extensive experiments demonstrate that on key molecular tasks, such as chemical reaction prediction and molecular captioning, MoRA's instance-specific dynamic adaptation outperforms statically adapted baselines, including a 14.1% relative improvement in reaction prediction exact match and a 22% reduction in error for quantum property prediction. The code is available at https://github.com/jk-sounds/MoRA.
☆ A Gradient Guided Diffusion Framework for Chance Constrained Programming
Chance constrained programming (CCP) is a powerful framework for addressing optimization problems under uncertainty. In this paper, we introduce a novel Gradient-Guided Diffusion-based Optimization framework, termed GGDOpt, which tackles CCP through three key innovations. First, GGDOpt accommodates a broad class of CCP problems without requiring the knowledge of the exact distribution of uncertainty-relying solely on a set of samples. Second, to address the nonconvexity of the chance constraints, it reformulates the CCP as a sampling problem over the product of two distributions: an unknown data distribution supported on a nonconvex set and a Boltzmann distribution defined by the objective function, which fully leverages both first- and second-order gradient information. Third, GGDOpt has theoretical convergence guarantees and provides practical error bounds under mild assumptions. By progressively injecting noise during the forward diffusion process to convexify the nonconvex feasible region, GGDOpt enables guided reverse sampling to generate asymptotically optimal solutions. Experimental results on synthetic datasets and a waveform design task in wireless communications demonstrate that GGDOpt outperforms existing methods in both solution quality and stability with nearly 80% overhead reduction.
☆ Unveiling the Vulnerability of Graph-LLMs: An Interpretable Multi-Dimensional Adversarial Attack on TAGs
Graph Neural Networks (GNNs) have become a pivotal framework for modeling graph-structured data, enabling a wide range of applications from social network analysis to molecular chemistry. By integrating large language models (LLMs), text-attributed graphs (TAGs) enhance node representations with rich textual semantics, significantly boosting the expressive power of graph-based learning. However, this sophisticated synergy introduces critical vulnerabilities, as Graph-LLMs are susceptible to adversarial attacks on both their structural topology and textual attributes. Although specialized attack methods have been designed for each of these aspects, no work has yet unified them into a comprehensive approach. In this work, we propose the Interpretable Multi-Dimensional Graph Attack (IMDGA), a novel human-centric adversarial attack framework designed to orchestrate multi-level perturbations across both graph structure and textual features. IMDGA utilizes three tightly integrated modules to craft attacks that balance interpretability and impact, enabling a deeper understanding of Graph-LLM vulnerabilities. Through rigorous theoretical analysis and comprehensive empirical evaluations on diverse datasets and architectures, IMDGA demonstrates superior interpretability, attack effectiveness, stealthiness, and robustness compared to existing methods. By exposing critical weaknesses in TAG representation learning, this work uncovers a previously underexplored semantic dimension of vulnerability in Graph-LLMs, offering valuable insights for improving their resilience. Our code and resources are publicly available at https://anonymous.4open.science/r/IMDGA-7289.
comment: 12 pages, 4 figures
☆ HoneyBee: Data Recipes for Vision-Language Reasoners
Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.
comment: 32 pages
☆ Hierarchical Koopman Diffusion: Fast Generation with Interpretable Diffusion Trajectory NeurIPS 2025
Diffusion models have achieved impressive success in high-fidelity image generation but suffer from slow sampling due to their inherently iterative denoising process. While recent one-step methods accelerate inference by learning direct noise-to-image mappings, they sacrifice the interpretability and fine-grained control intrinsic to diffusion dynamics, key advantages that enable applications like editable generation. To resolve this dichotomy, we introduce \textbf{Hierarchical Koopman Diffusion}, a novel framework that achieves both one-step sampling and interpretable generative trajectories. Grounded in Koopman operator theory, our method lifts the nonlinear diffusion dynamics into a latent space where evolution is governed by globally linear operators, enabling closed-form trajectory solutions. This formulation not only eliminates iterative sampling but also provides full access to intermediate states, allowing manual intervention during generation. To model the multi-scale nature of images, we design a hierarchical architecture that disentangles generative dynamics across spatial resolutions via scale-specific Koopman subspaces, capturing coarse-to-fine details systematically. We empirically show that the Hierarchical Koopman Diffusion not only achieves competitive one-step generation performance but also provides a principled mechanism for interpreting and manipulating the generative process through spectral analysis. Our framework bridges the gap between fast sampling and interpretability in diffusion models, paving the way for explainable image synthesis in generative modeling.
comment: NeurIPS 2025
☆ DE3S: Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification
Early time-series classification (ETSC) in medical applications is crucial for time-sensitive scenarios such as sepsis prediction in intensive care units (ICUs), where a large number of deaths are caused by delayed prediction. ETSC can significantly improve ICU resource utilization efficiency and healthcare precision. However, it faces conflicting goals of accuracy and earliness, with existing methods often trading one for the other, struggling to capture subtle early-stage patterns due to weak initial signals and class imbalance. The key to solve these challenges is to find shapelets, which are discriminative subsequences (or shapes) with high interpretability in time-series classification. This paper proposes Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification (DE3S), which introduces a novel Dual-Enhanced Soft-Shape Learning framework to figure out shapelets precisely through three innovations: (1) a comprehensive dual-enhancement strategy combines traditional temporal augmentation with attention-based global temporal enhancement for robust representation learning, (2) an attention-score-based soft shapelet sparsification mechanism dynamically preserves discriminative patterns while aggregating less important shapelets into representative tokens, and (3) a dual-path Mixture of Experts Network (MoE) and Inception modules fusion architecture where MoE performs local learning within shapelets and multi-scale Inception modules capture global patterns across shapelets. The framework employs weighted cross-entropy loss for class imbalance handling and demonstrates robustness on subject-consistency datasets. Extensive experiments on six real-world medical datasets show state-of-the-art performance, with ablation studies confirming component efficacy.
comment: Accepted to IEEE BIBM 2025
☆ DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.
☆ Revisiting Meta-Learning with Noisy Labels: Reweighting Dynamics and Theoretical Guarantees
Learning with noisy labels remains challenging because over-parameterized networks memorize corrupted supervision. Meta-learning-based sample reweighting mitigates this by using a small clean subset to guide training, yet its behavior and training dynamics lack theoretical understanding. We provide a rigorous theoretical analysis of meta-reweighting under label noise and show that its training trajectory unfolds in three phases: (i) an alignment phase that amplifies examples consistent with a clean subset and suppresses conflicting ones; (ii) a filtering phase driving noisy example weights toward zero until the clean subset loss plateaus; and (iii) a post-filtering phase in which noise filtration becomes perturbation-sensitive. The mechanism is a similarity-weighted coupling between training and clean subset signals together with clean subset training loss contraction; in the post-filtering regime where the clean-subset loss is sufficiently small, the coupling term vanishes and meta-reweighting loses discriminatory power. Guided by this analysis, we propose a lightweight surrogate for meta-reweighting that integrates mean-centering, row shifting, and label-signed modulation, yielding more stable performance while avoiding expensive bi-level optimization. Across synthetic and real noisy-label benchmarks, our method consistently outperforms strong reweighting/selection baselines.
☆ Controllable Collision Scenario Generation via Collision Pattern Prediction ICRA
Evaluating the safety of autonomous vehicles (AVs) requires diverse, safety-critical scenarios, with collisions being especially important yet rare and unsafe to collect in the real world. Therefore, the community has been focusing on generating safety-critical scenarios in simulation. However, controlling attributes such as collision type and time-to-accident (TTA) remains challenging. We introduce a new task called controllable collision scenario generation, where the goal is to produce trajectories that realize a user-specified collision type and TTA, to investigate the feasibility of automatically generating desired collision scenarios. To support this task, we present COLLIDE, a large-scale collision scenario dataset constructed by transforming real-world driving logs into diverse collisions, balanced across five representative collision types and different TTA intervals. We propose a framework that predicts Collision Pattern, a compact and interpretable representation that captures the spatial configuration of the ego and the adversarial vehicles at impact, before rolling out full adversarial trajectories. Experiments show that our approach outperforms strong baselines in both collision rate and controllability. Furthermore, generated scenarios consistently induce higher planner failure rates, revealing limitations of existing planners. We demonstrate that these scenarios fine-tune planners for robustness improvements, contributing to safer AV deployment in different collision scenarios.
comment: 8 pages, 3 figures. Submitted to IEEE International Conference on Robotics and Automation (ICRA) 2026
☆ Learning Mean-Field Games through Mean-Field Actor-Critic Flow
We propose the Mean-Field Actor-Critic (MFAC) flow, a continuous-time learning dynamics for solving mean-field games (MFGs), combining techniques from reinforcement learning and optimal transport. The MFAC framework jointly evolves the control (actor), value function (critic), and distribution components through coupled gradient-based updates governed by partial differential equations (PDEs). A central innovation is the Optimal Transport Geodesic Picard (OTGP) flow, which drives the distribution toward equilibrium along Wasserstein-2 geodesics. We conduct a rigorous convergence analysis using Lyapunov functionals and establish global exponential convergence of the MFAC flow under a suitable timescale. Our results highlight the algorithmic interplay among actor, critic, and distribution components. Numerical experiments illustrate the theoretical findings and demonstrate the effectiveness of the MFAC framework in computing MFG equilibria.
☆ Self-Verifying Reflection Helps Transformers with CoT Reasoning NeurIPS2025
Advanced large language models (LLMs) frequently reflect in reasoning chain-of-thoughts (CoTs), where they self-verify the correctness of current solutions and explore alternatives. However, given recent findings that LLMs detect limited errors in CoTs, how reflection contributes to empirical improvements remains unclear. To analyze this issue, in this paper, we present a minimalistic reasoning framework to support basic self-verifying reflection for small transformers without natural language, which ensures analytic clarity and reduces the cost of comprehensive experiments. Theoretically, we prove that self-verifying reflection guarantees improvements if verification errors are properly bounded. Experimentally, we show that tiny transformers, with only a few million parameters, benefit from self-verification in both training and reflective execution, reaching remarkable LLM-level performance in integer multiplication and Sudoku. Similar to LLM results, we find that reinforcement learning (RL) improves in-distribution performance and incentivizes frequent reflection for tiny transformers, yet RL mainly optimizes shallow statistical patterns without faithfully reducing verification errors. In conclusion, integrating generative transformers with discriminative verification inherently facilitates CoT reasoning, regardless of scaling and natural language.
comment: Accepted by NeurIPS2025
☆ Follow-the-Perturbed-Leader for Decoupled Bandits: Best-of-Both-Worlds and Practicality
We study the decoupled multi-armed bandit (MAB) problem, where the learner selects one arm for exploration and one arm for exploitation in each round. The loss of the explored arm is observed but not counted, while the loss of the exploited arm is incurred without being observed. We propose a policy within the Follow-the-Perturbed-Leader (FTPL) framework using Pareto perturbations. Our policy achieves (near-)optimal regret regardless of the environment, i.e., Best-of-Both-Worlds (BOBW): constant regret in the stochastic regime, improving upon the optimal bound of the standard MABs, and minimax optimal regret in the adversarial regime. Moreover, the practicality of our policy stems from avoiding both the convex optimization step required by the previous BOBW policy, Decoupled-Tsallis-INF (Rouyer & Seldin, 2020), and the resampling step that is typically necessary in FTPL. Consequently, it achieves substantial computational improvement, about $20$ times faster than Decoupled-Tsallis-INF, while also demonstrating better empirical performance in both regimes. Finally, we empirically show that our approach outperforms a pure exploration policy, and that naively combining a pure exploration with a standard exploitation policy is suboptimal.
comment: Preprint, 29 pages
☆ Probabilistic Super-Resolution for Urban Micrometeorology via a Schrödinger Bridge
This study employs a neural network that represents the solution to a Schr\"odinger bridge problem to perform super-resolution of 2-m temperature in an urban area. Schr\"odinger bridges generally describe transformations between two data distributions based on diffusion processes. We use a specific Schr\"odinger-bridge model (SM) that directly transforms low-resolution data into high-resolution data, unlike denoising diffusion probabilistic models (simply, diffusion models; DMs) that generate high-resolution data from Gaussian noise. Low-resolution and high-resolution data were obtained from separate numerical simulations with a physics-based model under common initial and boundary conditions. Compared with a DM, the SM attains comparable accuracy at one-fifth the computational cost, requiring 50 neural-network evaluations per datum for the DM and only 10 for the SM. Furthermore, high-resolution samples generated by the SM exhibit larger variance, implying superior uncertainty quantification relative to the DM. Owing to the reduced computational cost of the SM, our results suggest the feasibility of real-time ensemble micrometeorological prediction using SM-based super-resolution.
☆ Budget-constrained Active Learning to Effectively De-censor Survival Data
Standard supervised learners attempt to learn a model from a labeled dataset. Given a small set of labeled instances, and a pool of unlabeled instances, a budgeted learner can use its given budget to pay to acquire the labels of some unlabeled instances, which it can then use to produce a model. Here, we explore budgeted learning in the context of survival datasets, which include (right) censored instances, where we know only a lower bound on an instance's time-to-event. Here, that learner can pay to (partially) label a censored instance -- e.g., to acquire the actual time for an instance [perhaps go from (3 yr, censored) to (7.2 yr, uncensored)], or other variants [e.g., learn about one more year, so go from (3 yr, censored) to either (4 yr, censored) or perhaps (3.2 yr, uncensored)]. This serves as a model of real world data collection, where follow-up with censored patients does not always lead to uncensoring, and how much information is given to the learner model during data collection is a function of the budget and the nature of the data itself. We provide both experimental and theoretical results for how to apply state-of-the-art budgeted learning algorithms to survival data and the respective limitations that exist in doing so. Our approach provides bounds and time complexity asymptotically equivalent to the standard active learning method BatchBALD. Moreover, empirical analysis on several survival tasks show that our model performs better than other potential approaches on several benchmarks.
☆ Fairness-Constrained Optimization Attack in Federated Learning
Federated learning (FL) is a privacy-preserving machine learning technique that facilitates collaboration among participants across demographics. FL enables model sharing, while restricting the movement of data. Since FL provides participants with independence over their training data, it becomes susceptible to poisoning attacks. Such collaboration also propagates bias among the participants, even unintentionally, due to different data distribution or historical bias present in the data. This paper proposes an intentional fairness attack, where a client maliciously sends a biased model, by increasing the fairness loss while training, even considering homogeneous data distribution. The fairness loss is calculated by solving an optimization problem for fairness metrics such as demographic parity and equalized odds. The attack is insidious and hard to detect, as it maintains global accuracy even after increasing the bias. We evaluate our attack against the state-of-the-art Byzantine-robust and fairness-aware aggregation schemes over different datasets, in various settings. The empirical results demonstrate the attack efficacy by increasing the bias up to 90\%, even in the presence of a single malicious client in the FL system.
comment: To appear in IEEE TrustCom 2025
☆ Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration NeurIPS25
Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.
comment: NeurIPS25
☆ nuGPR: GPU-Accelerated Gaussian Process Regression with Iterative Algorithms and Low-Rank Approximations
Gaussian Process Regression (GPR) is an important type of supervised machine learning model with inherent uncertainty measure in its predictions. We propose a new framework, nuGPR, to address the well-known challenge of high computation cost associated with GPR training. Our framework includes several ideas from numerical linear algebra to reduce the amount of computation in key steps of GPR, and we combine them to establish an end-to-end training algorithm. Specifically, we leverage the preconditioned conjugate gradient method to accelerate the convergence of the linear solves required in GPR. We exploit clustering in the input data to identify block-diagonal structure of the covariance matrix and subsequently construct low-rank approximations of the off-diagonal blocks. These enhancements significantly reduce the time and space complexity of our computations. In addition, unlike other frameworks that rely on exact differentiation, we employ numerical gradients to optimize the hyperparameters of our GPR model, further reducing the training cost by eliminating the need for backpropagation. Lastly, we leverage the CUDA Toolkit to efficiently parallelize the training procedure on NVIDIA GPUs. As a result, nuGPR reduces total training time by up to 2x and peak memory consumption by up to 12x on various synthetic and real-world datasets when compared to the best existing GPU-based GPR implementation.
comment: 22 pages, 6 figures, published in SIAM Journal on Scientific Computing, E-print available at: https://epubs.siam.org/eprint/5CF5CKX49Y4FUQXZFHCN/full
☆ Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control
☆ Locket: Robust Feature-Locking Technique for Language Models
Chatbot providers (e.g., OpenAI) rely on tiered subscription schemes to generate revenue, offering basic models for free users, and advanced models for paying subscribers. However, a finer-grained pay-to-unlock scheme for premium features (e.g., math, coding) is thought to be more economically viable for the providers. Such a scheme requires a feature-locking technique (FLoTE) which is (i) effective in refusing locked features, (ii) utility-preserving for unlocked features, (iii) robust against evasion or unauthorized credential sharing, and (iv) scalable to multiple features and users. However, existing FLoTEs (e.g., password-locked models) are not robust or scalable. We present Locket, the first robust and scalable FLoTE to enable pay-to-unlock schemes. Locket uses a novel merging approach to attach adapters to an LLM for refusing unauthorized features. Our comprehensive evaluation shows that Locket is effective ($100$% refusal on locked features), utility-preserving ($\leq 7$% utility degradation in unlocked features), robust ($\leq 5$% attack success rate), and scales to multiple features and clients.
comment: 12 pages, 3 figures
☆ Chimera: State Space Models Beyond Sequences
Transformer-based deep learning methods have become the standard approach for modeling diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires inductive biases--such as position embeddings in sequences and images, or random walks in graphs--to incorporate topology. However, designing such task-specific biases requires significant effort and can introduce side effects that hinder generalization. We introduce Chimera, a unified model that directly incorporates data topology in a principled way, removing the need for domain-specific biases. The key idea is that state space models--which naturally do not require position embeddings--can be generalized to capture any graph topology. Our experiments show that Chimera achieves strong performance across language, vision, and graph domains, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on the Long Range Graph Benchmark. We further propose algorithmic optimizations to improve Chimera's efficiency: (1) for Directed Acyclic Graphs, Chimera can be implemented as a linear-time recurrence; (2) for general graphs, a simple mathematical relaxation achieves Transformer's quadratic complexity without domain-specific heuristics. These results validate Chimera's core contribution and support the idea that data topology is a powerful inductive bias across modalities.
comment: Published in TMLR (October 2025); 22 Pages, 6 Figures, 11 Tables
☆ Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning
Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL), where larger models often degrade performance due to unique optimization pathologies such as plasticity loss. While recent works show that dynamically adapting network topology during training can mitigate these issues, existing studies have three critical limitations: (1) applying uniform dynamic training strategies across all modules despite encoder, critic, and actor following distinct learning paradigms, (2) focusing evaluation on basic architectures without clarifying the relative importance and interaction between dynamic training and architectural improvements, and (3) lacking systematic comparison between different dynamic approaches including sparse-to-sparse, dense-to-sparse, and sparse-to-dense. Through comprehensive investigation across modules and architectures, we reveal that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements. We finally distill these insights into Module-Specific Training (MST), a practical framework that further exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.
☆ H4G: Unlocking Faithful Inference for Zero-Shot Graph Learning in Hyperbolic Space
Text-attributed graphs are widely used across domains, offering rich opportunities for zero-shot learning via graph-text alignment. However, existing methods struggle with tasks requiring fine-grained pattern recognition, particularly on heterophilic graphs. Through empirical and theoretical analysis, we identify an \textbf{over-abstraction problem}: current approaches operate at excessively large hyperbolic radii, compressing multi-scale structural information into uniform high-level abstractions. This abstraction-induced information loss obscures critical local patterns essential for accurate predictions. By analyzing embeddings in hyperbolic space, we demonstrate that optimal graph learning requires \textbf{faithful preservation} of fine-grained structural details, better retained by representations positioned closer to the origin. To address this, we propose \textbf{H4G}, a framework that systematically reduces embedding radii using learnable block-diagonal scaling matrices and M\"obius matrix multiplication. This approach restores access to fine-grained patterns while maintaining global receptive ability with minimal computational overhead. Experiments show H4G achieves state-of-the-art zero-shot performance with \textbf{12.8\%} improvement on heterophilic graphs and \textbf{8.4\%} on homophilic graphs, confirming that radius reduction enables faithful multi-scale representation for advancing zero-shot graph learning.
☆ One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.
comment: Project page: https://onelife-worldmodel.github.io/; 39 pages
☆ GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs
Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: \textbf{Can alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures?} We introduce \textbf{GraphShaper}, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47\% accuracy improvements on citation networks and 7.63\% on social networks in zero-shot settings.
☆ FedLoDrop: Federated LoRA with Dropout for Generalized LLM Fine-tuning
Fine-tuning (FT) large language models (LLMs) is crucial for adapting general-purpose models to specific tasks, enhancing accuracy and relevance with minimal resources. To further enhance generalization ability while reducing training costs, this paper proposes Federated LoRA with Dropout (FedLoDrop), a new framework that applies dropout to the rows and columns of the trainable matrix in Federated LoRA. A generalization error bound and convergence analysis under sparsity regularization are obtained, which elucidate the fundamental trade-off between underfitting and overfitting. The error bound reveals that a higher dropout rate increases model sparsity, thereby lowering the upper bound of pointwise hypothesis stability (PHS). While this reduces the gap between empirical and generalization errors, it also incurs a higher empirical error, which, together with the gap, determines the overall generalization error. On the other hand, though dropout reduces communication costs, deploying FedLoDrop at the network edge still faces challenges due to limited network resources. To address this issue, an optimization problem is formulated to minimize the upper bound of the generalization error, by jointly optimizing the dropout rate and resource allocation subject to the latency and per-device energy consumption constraints. To solve this problem, a branch-and-bound (B\&B)-based method is proposed to obtain its globally optimal solution. Moreover, to reduce the high computational complexity of the B\&B-based method, a penalized successive convex approximation (P-SCA)-based algorithm is proposed to efficiently obtain its high-quality suboptimal solution. Finally, numerical results demonstrate the effectiveness of the proposed approach in mitigating overfitting and improving the generalization capability.
☆ Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory
We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with quantization, factorization, and other compression techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model compression.
comment: 33 pages, 21 figures
☆ Influence Dynamics and Stagewise Data Attribution
Current training data attribution (TDA) methods treat the influence one sample has on another as static, but neural networks learn in distinct stages that exhibit changing patterns of influence. In this work, we introduce a framework for stagewise data attribution grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first validate these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model's progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.
comment: 28 pages, 15 figures
☆ MEASURE: Multi-scale Minimal Sufficient Representation Learning for Domain Generalization in Sleep Staging
Deep learning-based automatic sleep staging has significantly advanced in performance and plays a crucial role in the diagnosis of sleep disorders. However, those models often struggle to generalize on unseen subjects due to variability in physiological signals, resulting in degraded performance in out-of-distribution scenarios. To address this issue, domain generalization approaches have recently been studied to ensure generalized performance on unseen domains during training. Among those techniques, contrastive learning has proven its validity in learning domain-invariant features by aligning samples of the same class across different domains. Despite its potential, many existing methods are insufficient to extract adequately domain-invariant representations, as they do not explicitly address domain characteristics embedded within the unshared information across samples. In this paper, we posit that mitigating such domain-relevant attributes-referred to as excess domain-relevant information-is key to bridging the domain gap. However, the direct strategy to mitigate the domain-relevant attributes often overfits features at the high-level information, limiting their ability to leverage the diverse temporal and spectral information encoded in the multiple feature levels. To address these limitations, we propose a novel MEASURE (Multi-scalE minimAl SUfficient Representation lEarning) framework, which effectively reduces domain-relevant information while preserving essential temporal and spectral features for sleep stage classification. In our exhaustive experiments on publicly available sleep staging benchmark datasets, SleepEDF-20 and MASS, our proposed method consistently outperformed state-of-the-art methods. Our code is available at : https://github.com/ku-milab/Measure
comment: 12 page, 7 figures, uses IEEE.sty
☆ AI Agents as Universal Task Solvers
AI reasoning agents are already able to solve a variety of tasks by deploying tools, simulating outcomes of multiple hypotheses and reflecting on them. In doing so, they perform computation, although not in the classical sense -- there is no program being executed. Still, if they perform computation, can AI agents be universal? Can chain-of-thought reasoning solve any computable task? How does an AI Agent learn to reason? Is it a matter of model size? Or training dataset size? In this work, we reinterpret the role of learning in the context of AI Agents, viewing them as compute-capable stochastic dynamical systems, and highlight the role of time in a foundational principle for learning to reason. In doing so, we propose a shift from classical inductive learning to transductive learning -- where the objective is not to approximate the distribution of past data, but to capture their algorithmic structure to reduce the time needed to find solutions to new tasks. Transductive learning suggests that, counter to Shannon's theory, a key role of information in learning is about reduction of time rather than reconstruction error. In particular, we show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information. Using this, we show a theoretical derivation for the observed power-law scaling of inference time versus training time. We then show that scaling model size can lead to behaviors that, while improving accuracy on benchmarks, fail any reasonable test of intelligence, let alone super-intelligence: In the limit of infinite space and time, large models can behave as savants, able to brute-force through any task without any insight. Instead, we argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered.
☆ Your VAR Model is Secretly an Efficient and Explainable Generative Classifier
Generative classifiers, which leverage conditional generative models for classification, have recently demonstrated desirable properties such as robustness to distribution shifts. However, recent progress in this area has been largely driven by diffusion-based models, whose substantial computational cost severely limits scalability. This exclusive focus on diffusion-based methods has also constrained our understanding of generative classifiers. In this work, we propose a novel generative classifier built on recent advances in visual autoregressive (VAR) modeling, which offers a new perspective for studying generative classifiers. To further enhance its performance, we introduce the Adaptive VAR Classifier$^+$ (A-VARC$^+$), which achieves a superior trade-off between accuracy and inference speed, thereby significantly improving practical applicability. Moreover, we show that the VAR-based method exhibits fundamentally different properties from diffusion-based methods. In particular, due to its tractable likelihood, the VAR-based classifier enables visual explainability via token-wise mutual information and demonstrates inherent resistance to catastrophic forgetting in class-incremental learning tasks.
☆ MIARec: Mutual-influence-aware Heterogeneous Network Embedding for Scientific Paper Recommendation
With the rapid expansion of scientific literature, scholars increasingly demand precise and high-quality paper recommendations. Among various recommendation methodologies, graph-based approaches have garnered attention by effectively exploiting the structural characteristics inherent in scholarly networks. However, these methods often overlook the asymmetric academic influence that is prevalent in scholarly networks when learning graph representations. To address this limitation, this study proposes the Mutual-Influence-Aware Recommendation (MIARec) model, which employs a gravity-based approach to measure the mutual academic influence between scholars and incorporates this influence into the feature aggregation process during message propagation in graph representation learning. Additionally, the model utilizes a multi-channel aggregation method to capture both individual embeddings of distinct single relational sub-networks and their interdependent embeddings, thereby enabling a more comprehensive understanding of the heterogeneous scholarly network. Extensive experiments conducted on real-world datasets demonstrate that the MIARec model outperforms baseline models across three primary evaluation metrics, indicating its effectiveness in scientific paper recommendation tasks.
☆ Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models
Recent advancements in large language models (LLMs) have shown strong performance in natural language understanding and generation tasks. However, LLMs continue to encounter challenges with hallucinations, where models generate plausible but incorrect information. While several factors contribute to hallucinations, the impact of ill-formed prompts, prompts with ambiguous wording, incorrect grammar, or incomplete information, was relatively under explored. To address this, we introduce Multi-stage Prompt Refinement (MPR), a framework designed to systematically improve these ill-formed prompts across multiple stages. Each stage addresses specific errors such as punctuation, typographical mistakes, and misuse of key terms, using small language models (SLMs) fine-tuned for these tasks. MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input. Experimental results on hallucination benchmarks show that prompts refined by MPR achieve over an 85~\% win rate compared to their original forms, demonstrating its effectiveness in reducing hallucinations and improving LLM output accuracy. Interestingly, we reveal that MPR can be combined with existing post-hoc hallucination mitigation frameworks, further enhancing its versatility. MPR provides a lightweight and adaptable solution for enhancing LLM reliability across various domains.
comment: 22 pages, 6 figures
☆ CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement
Recent advancements in large language models (LLMs) highlight their fluency in generating responses to diverse prompts. However, these models sometimes generate plausible yet incorrect ``hallucinated" facts, undermining trust. A frequent but often overlooked cause of such errors is the use of poorly structured or vague prompts by users, leading LLMs to base responses on assumed rather than actual intentions. To mitigate hallucinations induced by these ill-formed prompts, we introduce Curative Prompt Refinement (CPR), a plug-and-play framework for curative prompt refinement that 1) cleans ill-formed prompts, and 2) generates additional informative task descriptions to align the intention of the user and the prompt using a fine-tuned small language model. When applied to language models, we discover that CPR significantly increases the quality of generation while also mitigating hallucination. Empirical studies show that prompts with CPR applied achieves over a 90\% win rate over the original prompts without any external knowledge.
comment: 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 7 pages, 2 figures
☆ Mamaba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning
Mamba, a recently proposed linear-time sequence model, has attracted significant attention for its computational efficiency and strong empirical performance. However, a rigorous theoretical understanding of its underlying mechanisms remains limited. In this work, we provide a theoretical analysis of Mamba's in-context learning (ICL) capability by focusing on tasks defined by low-dimensional nonlinear target functions. Specifically, we study in-context learning of a single-index model $y \approx g_*(\langle \boldsymbol{\beta}, \boldsymbol{x} \rangle)$, which depends on only a single relevant direction $\boldsymbol{\beta}$, referred to as feature. We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning, extracting the relevant direction directly from context examples. Consequently, we establish a test-time sample complexity that improves upon linear Transformers -- analyzed to behave like kernel methods -- and is comparable to nonlinear Transformers, which have been shown to surpass the Correlational Statistical Query (CSQ) lower bound and achieve near information-theoretically optimal rate in previous works. Our analysis reveals the crucial role of the nonlinear gating mechanism in Mamba for feature extraction, highlighting it as the fundamental driver behind Mamba's ability to achieve both computational efficiency and high performance.
comment: 34 pages
☆ Randomness and Interpolation Improve Gradient Descent
Based on Stochastic Gradient Descent (SGD), the paper introduces two optimizers, named Interpolational Accelerating Gradient Descent (IAGD) as well as Noise-Regularized Stochastic Gradient Descent (NRSGD). IAGD leverages second-order Newton Interpolation to expedite the convergence process during training, assuming relevancy in gradients between iterations. To avoid over-fitting, NRSGD incorporates a noise regularization technique that introduces controlled noise to the gradients during the optimization process. Comparative experiments of this research are conducted on the CIFAR-10, and CIFAR-100 datasets, benchmarking different CNNs(Convolutional Neural Networks) with IAGD and NRSGD against classical optimizers in Keras Package. Results demonstrate the potential of those two viable improvement methods in SGD, implicating the effectiveness of the advancements.
☆ Conformal Inference for Open-Set and Imbalanced Classification
This paper presents a conformal prediction method for classification in highly imbalanced and open-set settings, where there are many possible classes and not all may be represented in the data. Existing approaches require a finite, known label space and typically involve random sample splitting, which works well when there is a sufficient number of observations from each class. Consequently, they have two limitations: (i) they fail to provide adequate coverage when encountering new labels at test time, and (ii) they may become overly conservative when predicting previously seen labels. To obtain valid prediction sets in the presence of unseen labels, we compute and integrate into our predictions a new family of conformal p-values that can test whether a new data point belongs to a previously unseen class. We study these p-values theoretically, establishing their optimality, and uncover an intriguing connection with the classical Good--Turing estimator for the probability of observing a new species. To make more efficient use of imbalanced data, we also develop a selective sample splitting algorithm that partitions training and calibration data based on label frequency, leading to more informative predictions. Despite breaking exchangeability, this allows maintaining finite-sample guarantees through suitable re-weighting. With both simulated and real data, we demonstrate our method leads to prediction sets with valid coverage even in challenging open-set scenarios with infinite numbers of possible labels, and produces more informative predictions under extreme class imbalance.
☆ Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.
☆ Bridging Idealized and Operational Models: An Explainable AI Framework for Earth System Emulators
Computer models are indispensable tools for understanding the Earth system. While high-resolution operational models have achieved many successes, they exhibit persistent biases, particularly in simulating extreme events and statistical distributions. In contrast, coarse-grained idealized models isolate fundamental processes and can be precisely calibrated to excel in characterizing specific dynamical and statistical features. However, different models remain siloed by disciplinary boundaries. By leveraging the complementary strengths of models of varying complexity, we develop an explainable AI framework for Earth system emulators. It bridges the model hierarchy through a reconfigured latent data assimilation technique, uniquely suited to exploit the sparse output from the idealized models. The resulting bridging model inherits the high resolution and comprehensive variables of operational models while achieving global accuracy enhancements through targeted improvements from idealized models. Crucially, the mechanism of AI provides a clear rationale for these advancements, moving beyond black-box correction to physically insightful understanding in a computationally efficient framework that enables effective physics-assisted digital twins and uncertainty quantification. We demonstrate its power by significantly correcting biases in CMIP6 simulations of El Ni\~no spatiotemporal patterns, leveraging statistically accurate idealized models. This work also highlights the importance of pushing idealized model development and advancing communication between modeling communities.
☆ Information Shapes Koopman Representation
The Koopman operator provides a powerful framework for modeling dynamical systems and has attracted growing interest from the machine learning community. However, its infinite-dimensional nature makes identifying suitable finite-dimensional subspaces challenging, especially for deep architectures. We argue that these difficulties come from suboptimal representation learning, where latent variables fail to balance expressivity and simplicity. This tension is closely related to the information bottleneck (IB) dilemma: constructing compressed representations that are both compact and predictive. Rethinking Koopman learning through this lens, we demonstrate that latent mutual information promotes simplicity, yet an overemphasis on simplicity may cause latent space to collapse onto a few dominant modes. In contrast, expressiveness is sustained by the von Neumann entropy, which prevents such collapse and encourages mode diversity. This insight leads us to propose an information-theoretic Lagrangian formulation that explicitly balances this tradeoff. Furthermore, we propose a new algorithm based on the Lagrangian formulation that encourages both simplicity and expressiveness, leading to a stable and interpretable Koopman representation. Beyond quantitative evaluations, we further visualize the learned manifolds under our representations, observing empirical results consistent with our theoretical predictions. Finally, we validate our approach across a diverse range of dynamical systems, demonstrating improved performance over existing Koopman learning methods. The implementation is publicly available at https://github.com/Wenxuan52/InformationKoopman.
☆ Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment
Automated ultrasonic weld inspection remains a significant challenge in the nondestructive evaluation (NDE) community to factors such as limited training data (due to the complexity of curating experimental specimens or high-fidelity simulations) and environmental volatility of many industrial settings (resulting in the corruption of on-the-fly measurements). Thus, an end-to-end machine learning (ML) workflow for acoustic weld inspection in realistic (i.e., industrial) settings has remained an elusive goal. This work addresses the challenges of data curation and signal corruption by proposing workflow consisting of a reduced-order modeling scheme, diffusion based distribution alignment, and U-Net-based segmentation and inversion. A reduced-order Helmholtz model based on Lamb wave theory is used to generate a comprehensive dataset over varying weld heterogeneity and crack defects. The relatively inexpensive low-order solutions provide a robust training dateset for inversion models which are refined through a transfer learning stage using a limited set of full 3D elastodynamic simulations. To handle out-of-distribution (OOD) real-world measurements with varying and unpredictable noise distributions, i.e., Laser Doppler Vibrometry scans, guided diffusion produces in-distribution representations of OOD experimental LDV scans which are subsequently processed by the inversion models. This integrated framework provides an end-to-end solution for automated weld inspection on real data.
comment: 26 pages, 6 page appendix
☆ Escaping Local Optima in the Waddington Landscape: A Multi-Stage TRPO-PPO Approach for Single-Cell Perturbation Analysis
Modeling cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology. Existing data-driven framework have advanced perturbation prediction through variational autoencoders, chemically conditioned autoencoders, and large-scale transformer pretraining. However, these models are prone to local optima in the nonconvex Waddington landscape of cell fate decisions, where poor initialization can trap trajectories in spurious lineages or implausible differentiation outcomes. While executable gene regulatory networks complement these approaches, automated design frameworks incorporate biological priors through multi-agent optimization. Yet, an approach that is completely data-driven with well-designed initialization to escape local optima and converge to a proper lineage remains elusive. In this work, we introduce a multistage reinforcement learning algorithm tailored for single-cell perturbation modeling. We first compute an explicit natural gradient update using Fisher-vector products and a conjugate gradient solver, scaled by a KL trust-region constraint to provide a safe, curvature-aware the first step for the policy. Starting with these preconditioned parameters, we then apply a second phase of proximal policy optimization (PPO) with clipped surrogates, exploiting minibatch efficiency to refine the policy. We demonstrate that this initialization substantially improves generalization on Single-cell RNA sequencing (scRNA-seq) and Single-cell ATAC sequencing (scATAC-seq) pertubation analysis.
comment: 9 pages, 2 figures, 3 tables
☆ From Narratives to Probabilistic Reasoning: Predicting and Interpreting Drivers' Hazardous Actions in Crashes Using Large Language Model
Vehicle crashes involve complex interactions between road users, split-second decisions, and challenging environmental conditions. Among these, two-vehicle crashes are the most prevalent, accounting for approximately 70% of roadway crashes and posing a significant challenge to traffic safety. Identifying Driver Hazardous Action (DHA) is essential for understanding crash causation, yet the reliability of DHA data in large-scale databases is limited by inconsistent and labor-intensive manual coding practices. Here, we present an innovative framework that leverages a fine-tuned large language model to automatically infer DHAs from textual crash narratives, thereby improving the validity and interpretability of DHA classifications. Using five years of two-vehicle crash data from MTCF, we fine-tuned the Llama 3.2 1B model on detailed crash narratives and benchmarked its performance against conventional machine learning classifiers, including Random Forest, XGBoost, CatBoost, and a neural network. The fine-tuned LLM achieved an overall accuracy of 80%, surpassing all baseline models and demonstrating pronounced improvements in scenarios with imbalanced data. To increase interpretability, we developed a probabilistic reasoning approach, analyzing model output shifts across original test sets and three targeted counterfactual scenarios: variations in driver distraction and age. Our analysis revealed that introducing distraction for one driver substantially increased the likelihood of "General Unsafe Driving"; distraction for both drivers maximized the probability of "Both Drivers Took Hazardous Actions"; and assigning a teen driver markedly elevated the probability of "Speed and Stopping Violations." Our framework and analytical methods provide a robust and interpretable solution for large-scale automated DHA detection, offering new opportunities for traffic safety analysis and intervention.
☆ Max It or Miss It: Benchmarking LLM On Solving Extremal Problems
Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.
comment: Our benchmark dataset is available at https://huggingface.co/datasets/binxingao/extrem-bench
☆ CSI-4CAST: A Hybrid Deep Learning Model for CSI Prediction with Comprehensive Robustness and Generalization Testing
Channel state information (CSI) prediction is a promising strategy for ensuring reliable and efficient operation of massive multiple-input multiple-output (mMIMO) systems by providing timely downlink (DL) CSI. While deep learning-based methods have advanced beyond conventional model-driven and statistical approaches, they remain limited in robustness to practical non-Gaussian noise, generalization across diverse channel conditions, and computational efficiency. This paper introduces CSI-4CAST, a hybrid deep learning architecture that integrates 4 key components, i.e., Convolutional neural network residuals, Adaptive correction layers, ShuffleNet blocks, and Transformers, to efficiently capture both local and long-range dependencies in CSI prediction. To enable rigorous evaluation, this work further presents a comprehensive benchmark, CSI-RRG for Regular, Robustness and Generalization testing, which includes more than 300,000 samples across 3,060 realistic scenarios for both TDD and FDD systems. The dataset spans multiple channel models, a wide range of delay spreads and user velocities, and diverse noise types and intensity degrees. Experimental results show that CSI-4CAST achieves superior prediction accuracy with substantially lower computational cost, outperforming baselines in 88.9% of TDD scenarios and 43.8% of FDD scenario, the best performance among all evaluated models, while reducing FLOPs by 5x and 3x compared to LLM4CP, the strongest baseline. In addition, evaluation over CSI-RRG provides valuable insights into how different channel factors affect the performance and generalization capability of deep learning models. Both the dataset (https://huggingface.co/CSI-4CAST) and evaluation protocols (https://github.com/AI4OPT/CSI-4CAST) are publicly released to establish a standardized benchmark and to encourage further research on robust and efficient CSI prediction.
☆ Deep Learning-Based Visual Fatigue Detection Using Eye Gaze Patterns in VR
Prolonged exposure to virtual reality (VR) systems leads to visual fatigue, impairs user comfort, performance, and safety, particularly in high-stakes or long-duration applications. Existing fatigue detection approaches rely on subjective questionnaires or intrusive physiological signals, such as EEG, heart rate, or eye-blink count, which limit their scalability and real-time applicability. This paper introduces a deep learning-based study for detecting visual fatigue using continuous eye-gaze trajectories recorded in VR. We use the GazeBaseVR dataset comprising binocular eye-tracking data from 407 participants across five immersive tasks, extract cyclopean eye-gaze angles, and evaluate six deep classifiers. Our results demonstrate that EKYT achieves up to 94% accuracy, particularly in tasks demanding high visual attention, such as video viewing and text reading. We further analyze gaze variance and subjective fatigue measures, indicating significant behavioral differences between fatigued and non-fatigued conditions. These findings establish eye-gaze dynamics as a reliable and nonintrusive modality for continuous fatigue detection in immersive VR, offering practical implications for adaptive human-computer interactions.
comment: 8 pages, 3 figures, Accepted at IEEE International Symposium on Emerging Metaverse (ISEMV 2025)
☆ Behavioral Biometrics for Automatic Detection of User Familiarity in VR
As virtual reality (VR) devices become increasingly integrated into everyday settings, a growing number of users without prior experience will engage with VR systems. Automatically detecting a user's familiarity with VR as an interaction medium enables real-time, adaptive training and interface adjustments, minimizing user frustration and improving task performance. In this study, we explore the automatic detection of VR familiarity by analyzing hand movement patterns during a passcode-based door-opening task, which is a well-known interaction in collaborative virtual environments such as meeting rooms, offices, and healthcare spaces. While novice users may lack prior VR experience, they are likely to be familiar with analogous real-world tasks involving keypad entry. We conducted a pilot study with 26 participants, evenly split between experienced and inexperienced VR users, who performed tasks using both controller-based and hand-tracking interactions. Our approach uses state-of-the-art deep classifiers for automatic VR familiarity detection, achieving the highest accuracies of 92.05% and 83.42% for hand-tracking and controller-based interactions, respectively. In the cross-device evaluation, where classifiers trained on controller data were tested using hand-tracking data, the model achieved an accuracy of 78.89%. The integration of both modalities in the mixed-device evaluation obtained an accuracy of 94.19%. Our results underline the promise of using hand movement biometrics for the real-time detection of user familiarity in critical VR applications, paving the way for personalized and adaptive VR experiences.
comment: 7 pages, 7 figures, 17th International Conference on Quality of Multimedia Experience
☆ Simplicial Gaussian Models: Representation and Inference
Probabilistic graphical models (PGMs) are powerful tools for representing statistical dependencies through graphs in high-dimensional systems. However, they are limited to pairwise interactions. In this work, we propose the simplicial Gaussian model (SGM), which extends Gaussian PGM to simplicial complexes. SGM jointly models random variables supported on vertices, edges, and triangles, within a single parametrized Gaussian distribution. Our model builds upon discrete Hodge theory and incorporates uncertainty at every topological level through independent random components. Motivated by applications, we focus on the marginal edge-level distribution while treating node- and triangle-level variables as latent. We then develop a maximum-likelihood inference algorithm to recover the parameters of the full SGM and the induced conditional dependence structure. Numerical experiments on synthetic simplicial complexes with varying size and sparsity confirm the effectiveness of our algorithm.
☆ Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check
Current unlearning metrics for generative models evaluate success based on reference responses or classifier outputs rather than assessing the core objective: whether the unlearned model behaves indistinguishably from a model that never saw the unwanted data. This reference-specific approach creates systematic blind spots, allowing models to appear successful while retaining unwanted knowledge accessible through alternative prompts or attacks. We address these limitations by proposing Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models by comparing bidirectional likelihood assignments over generated samples. Unlike existing approaches that rely on predetermined references, FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning. Our experiments on the TOFU benchmark for LLM unlearning and the UnlearnCanvas benchmark for text-to-image diffusion model unlearning reveal that methods achieving near-optimal scores on traditional metrics fail to achieve distributional equivalence, with many becoming more distant from the gold standard than before unlearning. These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
comment: 20 pages, 11 figures
☆ A Connection Between Score Matching and Local Intrinsic Dimension NeurIPS 2025
The local intrinsic dimension (LID) of data is a fundamental quantity in signal processing and learning theory, but quantifying the LID of high-dimensional, complex data has been a historically challenging task. Recent works have discovered that diffusion models capture the LID of data through the spectra of their score estimates and through the rate of change of their density estimates under various noise perturbations. While these methods can accurately quantify LID, they require either many forward passes of the diffusion model or use of gradient computation, limiting their applicability in compute- and memory-constrained scenarios. We show that the LID is a lower bound on the denoising score matching loss, motivating use of the denoising score matching loss as a LID estimator. Moreover, we show that the equivalent implicit score matching loss also approximates LID via the normal dimension and is closely related to a recent LID estimator, FLIPD. Our experiments on a manifold benchmark and with Stable Diffusion 3.5 indicate that the denoising score matching loss is a highly competitive and scalable LID estimator, achieving superior accuracy and memory footprint under increasing problem size and quantization level.
comment: Accepted to the 3rd SPIGM Workshop at NeurIPS 2025
☆ Balancing Performance and Reject Inclusion: A Novel Confident Inlier Extrapolation Framework for Credit Scoring
Reject Inference (RI) methods aim to address sample bias by inferring missing repayment data for rejected credit applicants. Traditional approaches often assume that the behavior of rejected clients can be extrapolated from accepted clients, despite potential distributional differences between the two populations. To mitigate this blind extrapolation, we propose a novel Confident Inlier Extrapolation framework (CI-EX). CI-EX iteratively identifies the distribution of rejected client samples using an outlier detection model and assigns labels to rejected individuals closest to the distribution of the accepted population based on probabilities derived from a supervised classification model. The effectiveness of our proposed framework is validated through experiments on two large real-world credit datasets. Performance is evaluated using the Area Under the Curve (AUC) as well as RI-specific metrics such as Kickout and a novel metric introduced in this work, denoted as Area under the Kickout. Our findings reveal that RI methods, including the proposed framework, generally involve a trade-off between AUC and RI-specific metrics. However, the proposed CI-EX framework consistently outperforms existing RI models from the credit literature in terms of RI-specific metrics while maintaining competitive performance in AUC across most experiments.
comment: 45 pages, 19 figures
☆ Simulation-Based Pretraining and Domain Adaptation for Astronomical Time Series with Minimal Labeled Data
Astronomical time-series analysis faces a critical limitation: the scarcity of labeled observational data. We present a pre-training approach that leverages simulations, significantly reducing the need for labeled examples from real observations. Our models, trained on simulated data from multiple astronomical surveys (ZTF and LSST), learn generalizable representations that transfer effectively to downstream tasks. Using classifier-based architectures enhanced with contrastive and adversarial objectives, we create domain-agnostic models that demonstrate substantial performance improvements over baseline methods in classification, redshift estimation, and anomaly detection when fine-tuned with minimal real data. Remarkably, our models exhibit effective zero-shot transfer capabilities, achieving comparable performance on future telescope (LSST) simulations when trained solely on existing telescope (ZTF) data. Furthermore, they generalize to very different astronomical phenomena (namely variable stars from NASA's \textit{Kepler} telescope) despite being trained on transient events, demonstrating cross-domain capabilities. Our approach provides a practical solution for building general models when labeled data is scarce, but domain knowledge can be encoded in simulations.
☆ A Multimodal XAI Framework for Trustworthy CNNs and Bias Detection in Deep Representation Learning
Standard benchmark datasets, such as MNIST, often fail to expose latent biases and multimodal feature complexities, limiting the trustworthiness of deep neural networks in high-stakes applications. We propose a novel multimodal Explainable AI (XAI) framework that unifies attention-augmented feature fusion, Grad-CAM++-based local explanations, and a Reveal-to-Revise feedback loop for bias detection and mitigation. Evaluated on multimodal extensions of MNIST, our approach achieves 93.2% classification accuracy, 91.6% F1-score, and 78.1% explanation fidelity (IoU-XAI), outperforming unimodal and non-explainable baselines. Ablation studies demonstrate that integrating interpretability with bias-aware learning enhances robustness and human alignment. Our work bridges the gap between performance, transparency, and fairness, highlighting a practical pathway for trustworthy AI in sensitive domains.
☆ An Investigation of Memorization Risk in Healthcare Foundation Models
Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.
☆ HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection
Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adapt to different speakers by updating only a small subset of the layers. We propose HyWA-PVAD, a hypernetwork weight adaptation method, and evaluate it against multiple baseline conditioning techniques. Our comparison shows consistent improvements in PVAD performance. HyWA also offers practical advantages for deployment by preserving the core VAD architecture. Our new approach improves the current conditioning techniques in two ways: i) increases the mean average precision, ii) simplifies deployment by reusing the same VAD architecture.
comment: Mahsa Ghazvini Nejad and Hamed Jafarzadeh Asl contributed equally to this work
☆ Pruning Cannot Hurt Robustness: Certified Trade-offs in Reinforcement Learning
Reinforcement learning (RL) policies deployed in real-world environments must remain reliable under adversarial perturbations. At the same time, modern deep RL agents are heavily over-parameterized, raising costs and fragility concerns. While pruning has been shown to improve robustness in supervised learning, its role in adversarial RL remains poorly understood. We develop the first theoretical framework for certified robustness under pruning in state-adversarial Markov decision processes (SA-MDPs). For Gaussian and categorical policies with Lipschitz networks, we prove that element-wise pruning can only tighten certified robustness bounds; pruning never makes the policy less robust. Building on this, we derive a novel three-term regret decomposition that disentangles clean-task performance, pruning-induced performance loss, and robustness gains, exposing a fundamental performance--robustness frontier. Empirically, we evaluate magnitude and micro-pruning schedules on continuous-control benchmarks with strong policy-aware adversaries. Across tasks, pruning consistently uncovers reproducible ``sweet spots'' at moderate sparsity levels, where robustness improves substantially without harming - and sometimes even enhancing - clean performance. These results position pruning not merely as a compression tool but as a structural intervention for robust RL.
comment: 24 pages, 13 figures
☆ Learning at the Speed of Physics: Equilibrium Propagation on Oscillator Ising Machines NeurIPS 2025
Physical systems that naturally perform energy descent offer a direct route to accelerating machine learning. Oscillator Ising Machines (OIMs) exemplify this idea: their GHz-frequency dynamics mirror both the optimization of energy-based models (EBMs) and gradient descent on loss landscapes, while intrinsic noise corresponds to Langevin dynamics - supporting sampling as well as optimization. Equilibrium Propagation (EP) unifies these processes into descent on a single total energy landscape, enabling local learning rules without global backpropagation. We show that EP on OIMs achieves competitive accuracy ($\sim 97.2 \pm 0.1 \%$ on MNIST, $\sim 88.0 \pm 0.1 \%$ on Fashion-MNIST), while maintaining robustness under realistic hardware constraints such as parameter quantization and phase noise. These results establish OIMs as a fast, energy-efficient substrate for neuromorphic learning, and suggest that EBMs - often bottlenecked by conventional processors - may find practical realization on physical hardware whose dynamics directly perform their optimization.
comment: 4 pages, 2 figures, NeurIPS 2025 Machine Learning and the Physical Sciences (ML4PS)
☆ FedGTEA: Federated Class-Incremental Learning with Gaussian Task Embedding and Alignment
We introduce a novel framework for Federated Class Incremental Learning, called Federated Gaussian Task Embedding and Alignment (FedGTEA). FedGTEA is designed to capture task-specific knowledge and model uncertainty in a scalable and communication-efficient manner. At the client side, the Cardinality-Agnostic Task Encoder (CATE) produces Gaussian-distributed task embeddings that encode task knowledge, address statistical heterogeneity, and quantify data uncertainty. Importantly, CATE maintains a fixed parameter size regardless of the number of tasks, which ensures scalability across long task sequences. On the server side, FedGTEA utilizes the 2-Wasserstein distance to measure inter-task gaps between Gaussian embeddings. We formulate the Wasserstein loss to enforce inter-task separation. This probabilistic formulation not only enhances representation learning but also preserves task-level privacy by avoiding the direct transmission of latent embeddings, aligning with the privacy constraints in federated learning. Extensive empirical evaluations on popular datasets demonstrate that FedGTEA achieves superior classification performance and significantly mitigates forgetting, consistently outperforming strong existing baselines.
☆ Who's Asking? Evaluating LLM Robustness to Inquiry Personas in Factual Question Answering
Large Language Models (LLMs) should answer factual questions truthfully, grounded in objective knowledge, regardless of user context such as self-disclosed personal information, or system personalization. In this paper, we present the first systematic evaluation of LLM robustness to inquiry personas, i.e. user profiles that convey attributes like identity, expertise, or belief. While prior work has primarily focused on adversarial inputs or distractors for robustness testing, we evaluate plausible, human-centered inquiry persona cues that users disclose in real-world interactions. We find that such cues can meaningfully alter QA accuracy and trigger failure modes such as refusals, hallucinated limitations, and role confusion. These effects highlight how model sensitivity to user framing can compromise factual reliability, and position inquiry persona testing as an effective tool for robustness evaluation.
Efficient Inference for Coupled Hidden Markov Models in Continuous Time and Discrete Space
Systems of interacting continuous-time Markov chains are a powerful model class, but inference is typically intractable in high dimensional settings. Auxiliary information, such as noisy observations, is typically only available at discrete times, and incorporating it via a Doob's $h-$transform gives rise to an intractable posterior process that requires approximation. We introduce Latent Interacting Particle Systems, a model class parameterizing the generator of each Markov chain in the system. Our inference method involves estimating look-ahead functions (twist potentials) that anticipate future information, for which we introduce an efficient parameterization. We incorporate this approximation in a twisted Sequential Monte Carlo sampling scheme. We demonstrate the effectiveness of our approach on a challenging posterior inference task for a latent SIRS model on a graph, and on a neural model for wildfire spread dynamics trained on real data.
☆ Toward LLM-Supported Automated Assessment of Critical Thinking Subskills
Critical thinking represents a fundamental competency in today's education landscape. Developing critical thinking skills through timely assessment and feedback is crucial; however, there has not been extensive work in the learning analytics community on defining, measuring, and supporting critical thinking. In this paper, we investigate the feasibility of measuring core "subskills" that underlie critical thinking. We ground our work in an authentic task where students operationalize critical thinking: student-written argumentative essays. We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays. We then evaluated three distinct approaches to automated scoring: zero-shot prompting, few-shot prompting, and supervised fine-tuning, implemented across three large language models (GPT-5, GPT-5-mini, and ModernBERT). GPT-5 with few-shot prompting achieved the strongest results and demonstrated particular strength on subskills with separable, frequent categories, while lower performance was observed for subskills that required detection of subtle distinctions or rare categories. Our results underscore critical trade-offs in automated critical thinking assessment: proprietary models offer superior reliability at higher cost, while open-source alternatives provide practical accuracy with reduced sensitivity to minority categories. Our work represents an initial step toward scalable assessment of higher-order reasoning skills across authentic educational contexts.
comment: preprint: 17 pages
☆ SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.
comment: Project page: https://research.nvidia.com/labs/sil/projects/simuli
☆ From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models
Large Language Models (LLMs) are increasingly being deployed as the reasoning engines for agentic AI systems, yet they exhibit a critical flaw: a rigid adherence to explicit rules that leads to decisions misaligned with human common sense and intent. This "rule-rigidity" is a significant barrier to building trustworthy autonomous agents. While prior work has shown that supervised fine-tuning (SFT) with human explanations can mitigate this issue, SFT is computationally expensive and inaccessible to many practitioners. To address this gap, we introduce the Rule-Intent Distinction (RID) Framework, a novel, low-compute meta-prompting technique designed to elicit human-aligned exception handling in LLMs in a zero-shot manner. The RID framework provides the model with a structured cognitive schema for deconstructing tasks, classifying rules, weighing conflicting outcomes, and justifying its final decision. We evaluated the RID framework against baseline and Chain-of-Thought (CoT) prompting on a custom benchmark of 20 scenarios requiring nuanced judgment across diverse domains. Our human-verified results demonstrate that the RID framework significantly improves performance, achieving a 95% Human Alignment Score (HAS), compared to 80% for the baseline and 75% for CoT. Furthermore, it consistently produces higher-quality, intent-driven reasoning. This work presents a practical, accessible, and effective method for steering LLMs from literal instruction-following to liberal, goal-oriented reasoning, paving the way for more reliable and pragmatic AI agents.
comment: 13 pages. Code and data are available at https://github.com/strongSoda/LITERAL-TO-LIBERAL
♻ ☆ Are Large Reasoning Models Interruptible?
Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.
comment: We found a code/data bug in Section 5's intervene experiments. We're not sure how much it would affect the overall results and thus we're planning to fix it and do further investigation. As we do not want to leave any incorrect information on the internet, we want to withdraw this submission
♻ ☆ Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks. Our codes and models are available at \href{https://github.com/THU-KEG/BGPO}{https://github.com/THU-KEG/BGPO}.
♻ ☆ Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
♻ ☆ Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.
♻ ☆ ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.
♻ ☆ Optimally Deep Networks -- Adapting Model Depth to Datasets for Superior Efficiency
Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.
comment: 6 pages, 3 figures, 1 table
♻ ☆ Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits
We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.
♻ ☆ How to Train Your Metamorphic Deep Neural Network
Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: https://github.com/TSommariva/HTTY_NeuMeta.
comment: Accepted with an Honorable Mention Award at ICIAP 2025 (Rome, Italy). 14 pages, 7 figures
♻ ☆ Multi-View Graph Learning with Graph-Tuple
Graph Neural Networks (GNNs) typically scale with the number of graph edges, making them well suited for sparse graphs but less efficient on dense graphs, such as point clouds or molecular interactions. A common remedy is to sparsify the graph via similarity thresholding or distance pruning, but this forces an arbitrary choice of a single interaction scale and discards crucial information from other scales. To overcome this limitation, we introduce a multi-view graph-tuple framework. Instead of a single graph, our graph-tuple framework partitions the graph into disjoint subgraphs, capturing primary local interactions and weaker, long-range connections. We then learn multi-view representations from the graph-tuple via a heterogeneous message-passing architecture inspired by the theory of non-commuting operators, which we formally prove is strictly more expressive and guarantees a lower oracle risk compared to single-graph message-passing models. We instantiate our framework on two scientific domains: molecular property prediction from feature-scarce Coulomb matrices and cosmological parameter inference from geometric point clouds. On both applications, our multi-view graph-tuple models demonstrate better performance than single-graph baselines, highlighting the power and versatility of our multi-view approach.
comment: Submitted to TAG workshop
♻ ☆ Reconstruction of SINR Maps from Sparse Measurements using Group Equivariant Non-Expansive Operators
As sixth generation (6G) wireless networks evolve, accurate signal-to-interference-noise ratio (SINR) maps are becoming increasingly critical for effective resource management and optimization. However, acquiring such maps at high resolution is often cost-prohibitive, creating a severe data scarcity challenge. This necessitates machine learning (ML) approaches capable of robustly reconstructing the full map from extremely sparse measurements. To address this, we introduce a novel reconstruction framework based on Group Equivariant Non-Expansive Operators (GENEOs). Unlike data-hungry ML models, GENEOs are low-complexity operators that embed domain-specific geometric priors, such as translation invariance, directly into their structure. This provides a strong inductive bias, enabling effective reconstruction from very few samples. Our key insight is that for network management, preserving the topological structure of the SINR map, such as the geometry of coverage holes and interference patterns, is often more critical than minimizing pixel-wise error. We validate our approach on realistic ray-tracing-based urban scenarios, evaluating performance with both traditional statistical metrics (mean squared error (MSE)) and, crucially, a topological metric (1-Wasserstein distance). Results show that while maintaining competitive MSE, our method dramatically outperforms established ML baselines in topological fidelity. This demonstrates the practical advantage of GENEOs for creating structurally accurate SINR maps that are more reliable for downstream network optimization tasks.
♻ ☆ Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning
Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.
comment: 33 pages, 9 figures
♻ ☆ Fixed Point Explainability
This paper introduces a formal notion of fixed point explanations, inspired by the "why regress" principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results for several datasets and models, including LLMs such as Llama-3.3-70B.
comment: The code is available here: https://anonymous.4open.science/r/fixed_point_explainability_iclr2026-D188
♻ ☆ Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features
Linear TD($\lambda$) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first $L^2$ convergence rates for linear TD($\lambda$) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.
♻ ☆ Investigating Faithfulness in Large Audio Language Models
Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.
♻ ☆ Increase Alpha: Performance and Risk of an AI-Driven Trading Framework
There are inefficiencies in financial markets, with unexploited patterns in price, volume, and cross-sectional relationships. While many approaches use large-scale transformers, we take a domain-focused path: feed-forward and recurrent networks with curated features to capture subtle regularities in noisy financial data. This smaller-footprint design is computationally lean and reliable under low signal-to-noise, crucial for daily production at scale. At Increase Alpha, we built a deep-learning framework that maps over 800 U.S. equities into daily directional signals with minimal computational overhead. The purpose of this paper is twofold. First, we outline the general overview of the predictive model without disclosing its core underlying concepts. Second, we evaluate its real-time performance through transparent, industry standard metrics. Forecast accuracy is benchmarked against both naive baselines and macro indicators. The performance outcomes are summarized via cumulative returns, annualized Sharpe ratio, and maximum drawdown. The best portfolio combination using our signals provides a low-risk, continuous stream of returns with a Sharpe ratio of more than 2.5, maximum drawdown of around 3%, and a near-zero correlation with the S&P 500 market benchmark. We also compare the model's performance through different market regimes, such as the recent volatile movements of the US equity market in the beginning of 2025. Our analysis showcases the robustness of the model and significantly stable performance during these volatile periods. Collectively, these findings show that market inefficiencies can be systematically harvested with modest computational overhead if the right variables are considered. This report will emphasize the potential of traditional deep learning frameworks for generating an AI-driven edge in the financial market.
comment: To get access to the data, please contact s.ghatak@increasealpha.com The analysis now includes model performance through Q3 2025
♻ ☆ Quantizer Design for Finite Model Approximations, Model Learning, and Quantized Q-Learning for MDPs with Unbounded Spaces
In this paper, for Markov decision processes (MDPs) with unbounded state spaces we present refined upper bounds presented in [Kara et. al. JMLR'23] on finite model approximation errors via optimizing the quantizers used for finite model approximations. We also consider implications on quantizer design for quantized Q-learning and empirical model learning, and the performance of policies obtained via Q-learning where the quantized state is treated as the state itself. We highlight the distinctions between planning, where approximating MDPs can be independently designed, and learning (either via Q-learning or empirical model learning), where approximating MDPs are restricted to be defined by invariant measures of Markov chains under exploration policies, leading to significant subtleties on quantizer design performance, even though asymptotic near optimality can be established under both setups. In particular, under Lyapunov growth conditions, we obtain explicit upper bounds which decay to zero as the number of bins approaches infinity
♻ ☆ OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding NeurIPS 2025
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
comment: Accepted to NeurIPS 2025
♻ ☆ One Prompt Fits All: Universal Graph Adaptation for Pretrained Models NeurIPS 2025
Graph Prompt Learning (GPL) has emerged as a promising paradigm that bridges graph pretraining models and downstream scenarios, mitigating label dependency and the misalignment between upstream pretraining and downstream tasks. Although existing GPL studies explore various prompt strategies, their effectiveness and underlying principles remain unclear. We identify two critical limitations: (1) Lack of consensus on underlying mechanisms: Despite current GPLs have advanced the field, there is no consensus on how prompts interact with pretrained models, as different strategies intervene at varying spaces within the model, i.e., input-level, layer-wise, and representation-level prompts. (2) Limited scenario adaptability: Most methods fail to generalize across diverse downstream scenarios, especially under data distribution shifts (e.g., homophilic-to-heterophilic graphs). To address these issues, we theoretically analyze existing GPL approaches and reveal that representation-level prompts essentially function as fine-tuning a simple downstream classifier, proposing that graph prompt learning should focus on unleashing the capability of pretrained models, and the classifier should adapt to downstream scenarios. Based on our findings, we propose UniPrompt, a novel GPL method that adapts any pretrained models, unleashing the capability of pretrained models while preserving the input graph. Extensive experiments demonstrate that our method can effectively integrate with various pretrained models and achieve strong performance across in-domain and cross-domain scenarios.
comment: accepted by NeurIPS 2025 main conference
♻ ☆ PPA-Game: Characterizing and Learning Competitive Dynamics Among Online Content Creators KDD 2025
In this paper, we present the Proportional Payoff Allocation Game (PPA-Game), which characterizes situations where agents compete for divisible resources. In the PPA-game, agents select from available resources, and their payoffs are proportionately determined based on heterogeneous weights attributed to them. Such dynamics simulate content creators on online recommender systems like YouTube and TikTok, who compete for finite consumer attention, with content exposure reliant on inherent and distinct quality. We first conduct a game-theoretical analysis of the PPA-Game. While the PPA-Game does not always guarantee the existence of a pure Nash equilibrium (PNE), we identify prevalent scenarios ensuring its existence. Simulated experiments further prove that the cases where PNE does not exist rarely happen. Beyond analyzing static payoffs, we further discuss the agents' online learning about resource payoffs by integrating a multi-player multi-armed bandit framework. We propose an online algorithm facilitating each agent's maximization of cumulative payoffs over $T$ rounds. Theoretically, we establish that the regret of any agent is bounded by $O(\log^{1 + \eta} T)$ for any $\eta > 0$. Empirical results further validate the effectiveness of our online learning approach.
comment: KDD 2025
♻ ☆ Denoised Diffusion for Object-Focused Image Augmentation
Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
♻ ☆ A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random
Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.
comment: Binh H. Ho, Long Nguyen Chi, and TrungTin Nguyen are co-first authors
♻ ☆ Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain
The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.
comment: 27 pages
♻ ☆ Dimension Reduction with Locally Adjusted Graphs AAAI 2025
Dimension reduction (DR) algorithms have proven to be extremely useful for gaining insight into large-scale high-dimensional datasets, particularly finding clusters in transcriptomic data. The initial phase of these DR methods often involves converting the original high-dimensional data into a graph. In this graph, each edge represents the similarity or dissimilarity between pairs of data points. However, this graph is frequently suboptimal due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data. This problem is exacerbated as the dataset size increases. If we reduce the size of the dataset by selecting points for a specific sections of the embeddings, the clusters observed through DR are more separable since the extracted subgraphs are more reliable. In this paper, we introduce LocalMAP, a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address this challenge. By dynamically extracting subgraphs and updating the graph on-the-fly, LocalMAP is capable of identifying and separating real clusters within the data that other DR methods may overlook or combine. We demonstrate the benefits of LocalMAP through a case study on biological datasets, highlighting its utility in helping users more accurately identify clusters for real-world problems.
comment: Accepted by AAAI 2025
♻ ☆ Learning to sample fibers for goodness-of-fit testing
We consider the problem of constructing exact goodness-of-fit tests for discrete exponential family models. This classical problem remains practically unsolved for many types of structured or sparse data, as it rests on a computationally difficult core task: to produce a reliable sample from lattice points in a high-dimensional polytope. We translate the problem into a Markov decision process and demonstrate a reinforcement learning approach for learning `good moves' for sampling. We illustrate the approach on data sets and models for which traditional MCMC samplers converge too slowly due to problem size, sparsity structure, and the requirement to use prohibitive non-linear algebra computations in the process. The differentiating factor is the use of scalable tools from \emph{linear} algebra in the context of theoretical guarantees provided by \emph{non-linear} algebra. Our algorithm is based on an actor-critic sampling scheme, with provable convergence. The discovered moves can be used to efficiently obtain an exchangeable sample, significantly cutting computational times with regards to statistical testing.
comment: Most notable change is the link to code on github that implements the actor critic fiber sampler algorithm in python. Otherwise, minor revision adding some examples
♻ ☆ Go With the Flow: Fast Diffusion for Gaussian Mixture Models NIPS 2025
Schrodinger Bridges (SBs) are diffusion processes that steer, in finite time, a given initial distribution to another final one while minimizing a suitable cost functional. Although various methods for computing SBs have recently been proposed in the literature, most of these approaches require computationally expensive training schemes, even for solving low-dimensional problems. In this work, we propose an analytic parametrization of a set of feasible policies for steering the distribution of a dynamical system from one Gaussian Mixture Model (GMM) to another. Instead of relying on standard non-convex optimization techniques, the optimal policy within the set can be approximated as the solution of a low-dimensional linear program whose dimension scales linearly with the number of components in each mixture. The proposed method generalizes naturally to more general classes of dynamical systems, such as controllable linear time-varying systems, enabling efficient solutions to multi-marginal momentum SBs between GMMs, a challenging distribution interpolation problem. We showcase the potential of this approach in low-to-moderate dimensional problems such as image-to-image translation in the latent space of an autoencoder, learning of cellular dynamics using multi-marginal momentum SBs, and various other examples. The implementation is publicly available at https://github.com/georgeRapa/GMMflow.
comment: NIPS 2025 (spotlight)
♻ ☆ PEAR: Planner-Executor Agent Robustness Benchmark
Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.
♻ ☆ Can ChatGPT support software verification?
Large language models have become increasingly effective in software engineering tasks such as code generation, debugging and repair. Language models like ChatGPT can not only generate code, but also explain its inner workings and in particular its correctness. This raises the question whether we can utilize ChatGPT to support formal software verification. In this paper, we take some first steps towards answering this question. More specifically, we investigate whether ChatGPT can generate loop invariants. Loop invariant generation is a core task in software verification, and the generation of valid and useful invariants would likely help formal verifiers. To provide some first evidence on this hypothesis, we ask ChatGPT to annotate 106 C programs with loop invariants. We check validity and usefulness of the generated invariants by passing them to two verifiers, Frama-C and CPAchecker. Our evaluation shows that ChatGPT is able to produce valid and useful invariants allowing Frama-C to verify tasks that it could not solve before. Based on our initial insights, we propose ways of combining ChatGPT (or large language models in general) and software verifiers, and discuss current limitations and open issues.
comment: accepted at Fundamental Approaches to Software Engineering 2024
♻ ☆ A Generalized Information Bottleneck Theory of Deep Learning
The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.
♻ ☆ Variational Rank Reduction Autoencoders
Deterministic Rank Reduction Autoencoders (RRAEs) enforce by construction a regularization on the latent space by applying a truncated SVD. While this regularization makes Autoencoders more powerful, using them for generative purposes is counter-intuitive due to their deterministic nature. On the other hand, Variational Autoencoders (VAEs) are well known for their generative abilities by learning a probabilistic latent space. In this paper, we present Variational Rank Reduction Autoencoders (VRRAEs), a model that leverages the advantages of both RRAEs and VAEs. Our claims and results show that when carefully sampling the latent space of RRAEs and further regularizing with the Kullback-Leibler (KL) divergence (similarly to VAEs), VRRAEs outperform RRAEs and VAEs. Additionally, we show that the regularization induced by the SVD not only makes VRRAEs better generators than VAEs, but also reduces the possibility of posterior collapse. Our results include a synthetic dataset of a small size that showcases the robustness of VRRAEs against collapse, and three real-world datasets; the MNIST, CelebA, and CIFAR-10, over which VRRAEs are shown to outperform both VAEs and RRAEs on many random generation and interpolation tasks based on the FID score. We developed an open-source implementation of VRRAEs in JAX (Equinox), available at https://github.com/JadM133/RRAEs.git.
♻ ☆ General Exploratory Bonus for Optimistic Exploration in RLHF
Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.
♻ ☆ FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions NeurIPS 2025
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
comment: Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM)
♻ ☆ Central Limit Theorems for Asynchronous Averaged Q-Learning
This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We prove a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.
♻ ☆ Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
♻ ☆ Inverse Design in Nanophotonics via Representation Learning
Inverse design in nanophotonics, the computational discovery of structures achieving targeted electromagnetic (EM) responses, has become a key tool for recent optical advances. Traditional intuition-driven or iterative optimization methods struggle with the inherently high-dimensional, non-convex design spaces and the substantial computational demands of EM simulations. Recently, machine learning (ML) has emerged to address these bottlenecks effectively. This review frames ML-enhanced inverse design methodologies through the lens of representation learning, classifying them into two categories: output-side and input-side approaches. Output-side methods use ML to learn a representation in the solution space to create a differentiable solver that accelerates optimization. Conversely, input-side techniques employ ML to learn compact, latent-space representations of feasible device geometries, enabling efficient global exploration through generative models. Each strategy presents unique trade-offs in data requirements, generalization capacity, and novel design discovery potentials. Hybrid frameworks that combine physics-based optimization with data-driven representations help escape poor local optima, improve scalability, and facilitate knowledge transfer. We conclude by highlighting open challenges and opportunities, emphasizing complexity management, geometry-independent representations, integration of fabrication constraints, and advancements in multiphysics co-designs.
♻ ☆ MooseAgent: A LLM Based Multi-agent Framework for Automating Moose Simulation
The Finite Element Method (FEM) is widely used in engineering and scientific computing, but its pre-processing, solver configuration, and post-processing stages are often time-consuming and require specialized knowledge. This paper proposes an automated solution framework, MooseAgent, for the multi-physics simulation framework MOOSE, which combines large-scale pre-trained language models (LLMs) with a multi-agent system. The framework uses LLMs to understand user-described simulation requirements in natural language and employs task decomposition and multi-round iterative verification strategies to automatically generate MOOSE input files. To improve accuracy and reduce model hallucinations, the system builds and utilizes a vector database containing annotated MOOSE input cards and function documentation. We conducted experimental evaluations on several typical cases, including heat transfer, mechanics, phase field, and multi-physics coupling. The results show that MooseAgent can automate the MOOSE simulation process to a certain extent, especially demonstrating a high success rate when dealing with relatively simple single-physics problems. The main contribution of this research is the proposal of a multi-agent automated framework for MOOSE, which validates its potential in simplifying finite element simulation processes and lowering the user barrier, providing new ideas for the development of intelligent finite element simulation software. The code for the MooseAgent framework proposed in this paper has been open-sourced and is available at https://github.com/taozhan18/MooseAgent
comment: 11 pages, 3 Figs
♻ ☆ Protein Design with Dynamic Protein Vocabulary NeurIPS 2025
Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.
comment: Accepted to NeurIPS 2025 (Spotlight)
♻ ☆ Optimistic Multi-Agent Policy Gradient ICML 2024
*Relative overgeneralization* (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods although these methods produce state-of-the-art results. To address this gap, we propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem. Our approach involves clipping the advantage to eliminate negative values, thereby facilitating optimistic updates in MAPG. The optimism prevents individual agents from quickly converging to a local optimum. Additionally, we provide a formal analysis to show that the proposed method retains optimality at a fixed point. In extensive evaluations on a diverse set of tasks including the *Multi-agent MuJoCo* and *Overcooked* benchmarks, our method outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.
comment: Updated MMDP notation. Published at ICML 2024, 17 pages, 10 figures
♻ ☆ Query Brand Entity Linking in E-Commerce Search
In this work, we address the brand entity linking problem for e-commerce search queries. The entity linking task is done by either i)a two-stage process consisting of entity mention detection followed by entity disambiguation or ii) an end-to-end linking approaches that directly fetch the target entity given the input text. The task presents unique challenges: queries are extremely short (averaging 2.4 words), lack natural language structure, and must handle a massive space of unique brands. We present a two-stage approach combining named-entity recognition with matching, and a novel end-to-end solution using extreme multi-class classification. We validate our solutions by both offline benchmarks and the impact of online A/B test.
♻ ☆ Gen-C: Populating Virtual Worlds with Generative Crowds
Over the past two decades, researchers have made significant steps in simulating agent-based human crowds, yet most efforts remain focused on low-level tasks such as collision avoidance, path following, and flocking. Realistic simulations, however, require modeling high-level behaviors that emerge from agents interacting with each other and with their environment over time. We introduce Generative Crowds (Gen-C), a generative framework that produces crowd scenarios capturing agent-agent and agent-environment interactions, shaping coherent high-level crowd plans. To avoid the labor-intensive process of collecting and annotating real crowd video data, we leverage large language models (LLMs) to bootstrap synthetic datasets of crowd scenarios. We propose a time-expanded graph representation, encoding actions, interactions, and spatial context. Gen-C employs a dual Variational Graph Autoencoder (VGAE) architecture that jointly learns connectivity patterns and node features conditioned on textual and structural signals, overcoming the limitations of direct LLM generation to enable scalable, environment-aware multi-agent crowd simulations. We demonstrate the effectiveness of Gen-C on scenarios with diverse behaviors such as a University Campus and a Train Station, showing that it generates heterogeneous crowds, coherent interactions, and high-level decision-making patterns consistent with real-world crowd dynamics.
comment: 11 pages
♻ ☆ Resource-Constrained Federated Continual Learning: What Does Matter?
Federated Continual Learning (FCL) aims to enable sequentially privacy-preserving model training on streams of incoming data that vary in edge devices by preserving previous knowledge while adapting to new data. Current FCL literature focuses on restricted data privacy and access to previously seen data while imposing no constraints on the training overhead. This is unreasonable for FCL applications in real-world scenarios, where edge devices are primarily constrained by resources such as storage, computational budget, and label rate. We revisit this problem with a large-scale benchmark and analyze the performance of state-of-the-art FCL approaches under different resource-constrained settings. Various typical FCL techniques and six datasets in two incremental learning scenarios (Class-IL and Domain-IL) are involved in our experiments. Through extensive experiments amounting to a total of over 1,000+ GPU hours, we find that, under limited resource-constrained settings, existing FCL approaches, with no exception, fail to achieve the expected performance. Our conclusions are consistent in the sensitivity analysis. This suggests that most existing FCL methods are particularly too resource-dependent for real-world deployment. Moreover, we study the performance of typical FCL techniques with resource constraints and shed light on future research directions in FCL.
comment: arXiv admin note: text overlap with arXiv:2303.11165 by other authors
♻ ☆ Computing Systemic Risk Measures with Graph Neural Networks
This paper investigates systemic risk measures for stochastic financial networks of explicitly modelled bilateral liabilities. We extend the notion of systemic risk measures from Biagini, Fouque, Fritelli and Meyer-Brandis (2019) to graph structured data. In particular, we focus on an aggregation function that is derived from a market clearing algorithm proposed by Eisenberg and Noe (2001). In this setting, we show the existence of an optimal random allocation that distributes the overall minimal bailout capital and secures the network. We study numerical methods for the approximation of systemic risk and optimal random allocations. We propose to use permutation equivariant architectures of neural networks like graph neural networks (GNNs) and a class that we name (extended) permutation equivariant neural networks ((X)PENNs). We compare their performance to several benchmark allocations. The main feature of GNNs and (X)PENNs is that they are permutation equivariant with respect to the underlying graph data. In numerical experiments we find evidence that these permutation equivariant methods are superior to other approaches.
comment: 50 pages
♻ ☆ Dual Perspectives on Non-Contrastive Self-Supervised Learning
The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
♻ ☆ BAAF: A benchmark attention adaptive framework for medical ultrasound image segmentation tasks
The AI-based assisted diagnosis programs have been widely investigated on medical ultrasound images. Complex scenario of ultrasound image, in which the coupled interference of internal and external factors is severe, brings a unique challenge for localize the object region automatically and precisely in ultrasound images. In this study, we seek to propose a more general and robust Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or diagnose lesions and tissues in ultrasound images more quickly and accurately. Different from existing attention schemes, the BAAF consists of a parallel hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM). Specifically, BAAF first coarsely calibrates the input features from the channel and spatial dimensions, and then adaptively selects more robust lesion or tissue characterizations from the coarse-calibrated feature maps. The design of BAAF further optimizes the "what" and "where" focus and selection problems in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in medical ultrasound images. The method is evaluated on four medical ultrasound segmentation tasks, and the adequate experimental results demonstrate the remarkable performance improvement over existing state-of-the-art methods. In addition, the comparison with existing attention mechanisms also demonstrates the superiority of BAAF. This work provides the possibility for automated medical ultrasound assisted diagnosis and reduces reliance on human accuracy and precision.
comment: 10 pages, 11 figures
♻ ☆ ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
♻ ☆ Nash Equilibria in Games with Playerwise Concave Coupling Constraints: Existence and Computation
We study the existence and computation of Nash equilibria in continuous static games where the players' admissible strategies are subject to shared coupling constraints, i.e., constraints that depend on their \emph{joint} strategies. Specifically, we focus on a class of games characterized by playerwise concave utilities and playerwise concave constraints. Prior results on the existence of Nash equilibria are not applicable to this class, as they rely on strong assumptions such as joint convexity of the feasible set. By leveraging topological fixed point theory and novel structural insights into the contractibility of feasible sets under playerwise concave constraints, we give an existence proof for Nash equilibria under weaker conditions. Having established existence, we then focus on the computation of Nash equilibria via independent gradient methods under the additional assumption that the utilities admit a potential function. To account for the possibly nonconvex feasible region, we employ a log barrier regularized gradient ascent with adaptive stepsizes. Starting from an initial feasible strategy profile and under exact gradient feedback, the proposed method converges to an $\epsilon$-approximate constrained Nash equilibrium within $\mathcal{O}(\epsilon^{-3})$ iterations.
♻ ☆ Wasserstein-based Kernel Principal Component Analysis for Clustering Applications
Many data clustering applications must handle objects that cannot be represented as vectors. In this context, the bag-of-vectors representation describes complex objects through discrete distributions, for which the Wasserstein distance provides a well-conditioned dissimilarity measure. Kernel methods extend this by embedding distance information into feature spaces that facilitate analysis. However, an unsupervised framework that combines kernels with Wasserstein distances for clustering distributional data is still lacking. We address this gap by introducing a computationally tractable framework that integrates Wasserstein metrics with kernel methods for clustering. The framework can accommodate both vectorial and distributional data, enabling applications in various domains. It comprises three components: (i) an efficient approximation of pairwise Wasserstein distances using multiple reference distributions; (ii) shifted positive definite kernel functions based on Wasserstein distances, combined with kernel principal component analysis for feature mapping; and (iii) scalable, distance-agnostic validity indices for clustering evaluation and kernel parameter optimization. Experiments on power distribution graphs and real-world time series demonstrate the effectiveness and efficiency of the proposed framework.
♻ ☆ Offline Fictitious Self-Play for Competitive Games
Offline Reinforcement Learning (RL) enables policy improvement from fixed datasets without online interactions, making it highly suitable for real-world applications lacking efficient simulators. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces OFF-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn the best responses to different opponents and employ the Offline Self-Play learning framework. To overcome the challenge of partial coverage, we combine the single-agent offline RL method with Fictitious Self-Play (FSP) to approximate NE by constraining the approximate best responses away from out-of-distribution actions. Experiments on matrix games, extensive-form poker, and board games demonstrate that OFF-FSP achieves significantly lower exploitability than state-of-the-art baselines. Finally, we validate OFF-FSP on a real-world human-robot competitive task, demonstrating its potential for solving complex, hard-to-simulate real-world problems.
♻ ☆ SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning NeurIPS 2025
Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.
comment: NeurIPS 2025. Project page: https://scan-prm.github.io/
♻ ☆ Split Conformal Classification with Unsupervised Calibration
Methods for split conformal prediction leverage calibration samples to transform any prediction rule into a set-prediction rule that complies with a target coverage probability. Existing methods provide remarkably strong performance guarantees with minimal computational costs. However, they require to use calibration samples composed by labeled examples different to those used for training. This requirement can be highly inconvenient, as it prevents the use of all labeled examples for training and may require acquiring additional labels solely for calibration. This paper presents an effective methodology for split conformal prediction with unsupervised calibration for classification tasks. In the proposed approach, set-prediction rules are obtained using unsupervised calibration samples together with supervised training samples previously used to learn the classification rule. Theoretical and experimental results show that the presented methods can achieve performance comparable to that with supervised calibration, at the expenses of a moderate degradation in performance guarantees and computational efficiency.
♻ ☆ SpaPool: Soft Partition Assignment Pooling for__Graph Neural Networks
This paper introduces SpaPool, a novel pooling method that combines the strengths of both dense and sparse techniques for a graph neural network. SpaPool groups vertices into an adaptive number of clusters, leveraging the benefits of both dense and sparse approaches. It aims to maintain the structural integrity of the graph while reducing its size efficiently. Experimental results on several datasets demonstrate that SpaPool achieves competitive performance compared to existing pooling techniques and excels particularly on small-scale graphs. This makes SpaPool a promising method for applications requiring efficient and effective graph processing.
♻ ☆ Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning
Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.
♻ ☆ MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we study lightweight extensions to BERT that refine the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach, called MaxPoolBERT, enhances BERT's classification accuracy (especially on low-resource tasks) without requiring new pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance than the standard BERT base model on low resource tasks of the GLUE benchmark.
♻ ☆ Riemannian generative decoder ICML 2025
Riemannian representation learning typically relies on an encoder to estimate densities on chosen manifolds. This involves optimizing numerically brittle objectives, potentially harming model training and quality. To completely circumvent this issue, we introduce the Riemannian generative decoder, a unifying approach for finding manifold-valued latents on any Riemannian manifold. Latents are learned with a Riemannian optimizer while jointly training a decoder network. By discarding the encoder, we vastly simplify the manifold constraint compared to current approaches which often only handle few specific manifolds. We validate our approach on three case studies -- a synthetic branching diffusion process, human migrations inferred from mitochondrial DNA, and cells undergoing a cell division cycle -- each showing that learned representations respect the prescribed geometry and capture intrinsic non-Euclidean structure. Our method requires only a decoder, is compatible with existing architectures, and yields interpretable latent spaces aligned with data geometry. Code available on https://github.com/yhsure/riemannian-generative-decoder.
comment: GenBio ICML 2025 (Proceedings of the Workshop on Generative AI for Biology at the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025)
♻ ☆ Causal Ordering Without Effect Estimation: A Framework for Using Proxies in Treatment Prioritization
Who should we prioritize for treatment when causal effects cannot be estimated? In practice, organizations often rely on predictive proxies: ads are targeted using purchase probabilities, and retention incentives are allocated using churn-risk scores. These models are not causal, but they are often used with the aim of ranking individuals by treatment effects, a task we call causal ordering. We develop a decision-focused framework to reason about this practice. We identify conditions under which proxies recover the correct effect ordering, which hold when a proxy reflects a dominant moderator of treatment effects. We show how these conditions emerge as a useful approximation in discrete choice settings, where the propensity to act without an intervention moderates persuasion. Moreover, we extend beyond this case, demonstrating that proxies capturing a non-dominant moderator can still outperform CATE estimates when they target signals that are easier to estimate precisely. Building on these insights, we introduce diagnostic tools to assess proxy usefulness in practice. Finally, we illustrate the framework in advertising, where a simple predictive proxy outperforms heterogeneous-effect estimation methods.
♻ ☆ BrainForm: a Serious Game for BCI Training and Data Collection
$\textit{BrainForm}$ is a gamified Brain-Computer Interface (BCI) training system designed for scalable data collection using consumer hardware and a minimal setup. We investigated (1) how users develop BCI control skills across repeated sessions and (2) perceptual and performance effects of two visual stimulation textures. Game Experience Questionnaire (GEQ) scores for Flow, Positive Affect, Competence and Challenge were strongly positive, indicating sustained engagement. A within-subject study with multiple runs, two task complexities, and post-session questionnaires revealed no significant performance differences between textures but increased ocular irritation over time. Online metrics$\unicode{x2013}$Task Accuracy, Task Time, and Information Transfer Rate$\unicode{x2013}$improved across sessions, confirming learning effects for symbol spelling, even under pressure conditions. Our results highlight the potential of $\textit{BrainForm}$ as a scalable, user-friendly BCI research tool and offer guidance for sustained engagement and reduced training fatigue.
comment: 15 pages, 6 figures. Author-accepted version. Accepted for presentation at the Brain Informatics 2025 conference, to appear in Springer Lecture Notes in Artificial Intelligence (LNAI) Brain Informatics Books Series. The final authenticated version will be available via SpringerLink
♻ ☆ NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
comment: https://github.com/dunnolab/NinA/
♻ ☆ Interpretable Reward Model via Sparse Autoencoder
Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.
comment: Commercial firm need to review this paper before publishing it
♻ ☆ Constraint Matters: Multi-Modal Representation for Reducing Mixed-Integer Linear programming
Model reduction, which aims to learn a simpler model of the original mixed integer linear programming (MILP), can solve large-scale MILP problems much faster. Most existing model reduction methods are based on variable reduction, which predicts a solution value for a subset of variables. From a dual perspective, constraint reduction that transforms a subset of inequality constraints into equalities can also reduce the complexity of MILP, but has been largely ignored. Therefore, this paper proposes a novel constraint-based model reduction approach for the MILP. Constraint-based MILP reduction has two challenges: 1) which inequality constraints are critical such that reducing them can accelerate MILP solving while preserving feasibility, and 2) how to predict these critical constraints efficiently. To identify critical constraints, we first label these tight-constraints at the optimal solution as potential critical constraints and design a heuristic rule to select a subset of critical tight-constraints. To learn the critical tight-constraints, we propose a multi-modal representation technique that leverages information from both instance-level and abstract-level MILP formulations. The experimental results show that, compared to the state-of-the-art methods, our method improves the quality of the solution by over 50\% and reduces the computation time by 17.47\%.
comment: Since the article needs improvement, it will be temporarily withdrawn
♻ ☆ Persistent Homology via Ellipsoids
Persistent homology is one of the most popular methods in topological data analysis. An initial step in its use involves constructing a nested sequence of simplicial complexes. There is an abundance of different complexes to choose from, with \v{C}ech, Rips, alpha, and witness complexes being popular choices. In this manuscript, we build a novel type of geometrically informed simplicial complex, called a Rips-type ellipsoid complex. This complex is based on the idea that ellipsoids aligned with tangent directions better approximate the data compared to conventional (Euclidean) balls centered at sample points, as used in the construction of Rips and Alpha complexes. We use Principal Component Analysis to estimate tangent spaces directly from samples and present an algorithm for computing Rips-type ellipsoid barcodes, i.e., topological descriptors based on Rips-type ellipsoid complexes. Additionally, we show that the ellipsoid barcodes depend continuously on the input data so that small perturbations of a k-generic point cloud lead to proportionally small changes in the resulting ellipsoid barcodes. This provides a theoretical guarantee analogous, if somewhat weaker, to the classical stability results for Rips and \v{C}ech filtrations. We also conduct extensive experiments and compare Rips-type ellipsoid barcodes with standard Rips barcodes. Our findings indicate that Rips-type ellipsoid complexes are particularly effective for estimating the homology of manifolds and spaces with bottlenecks from samples. In particular, the persistence intervals corresponding to ground-truth topological features are longer compared to those obtained using the Rips complex of the data. Furthermore, Rips-type ellipsoid barcodes lead to better classification results in sparsely sampled point clouds. Finally, we demonstrate that Rips-type ellipsoid barcodes outperform Rips barcodes in classification tasks.
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a \textbf{Token Importance Recurrence} phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose \textbf{LazyEviction}, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50\%\textasciitilde70\% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.
♻ ☆ Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization
Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.
♻ ☆ BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals NeurIPS 2025
Electroencephalography (EEG) and magnetoencephalography (MEG) measure neural activity non-invasively by capturing electromagnetic fields generated by dendritic currents. Although rooted in the same biophysics, EEG and MEG exhibit distinct signal patterns, further complicated by variations in sensor configurations across modalities and recording devices. Existing approaches typically rely on separate, modality- and dataset-specific models, which limits the performance and cross-domain scalability. This paper proposes BrainOmni, the first brain foundation model that generalises across heterogeneous EEG and MEG recordings. To unify diverse data sources, we introduce BrainTokenizer,the first tokenizer that quantises spatiotemporal brain activity into discrete representations. Central to BrainTokenizer is a novel Sensor Encoder that encodes sensor properties such as spatial layout, orientation, and type, enabling compatibility across devices and modalities. Building upon the discrete representations, BrainOmni learns unified semantic embeddings of brain signals by self-supervised pretraining. To the best of our knowledge, it is the first foundation model to support both EEG and MEG signals, as well as the first to incorporate large-scale MEG pretraining. A total of 1,997 hours of EEG and 656 hours of MEG data are curated and standardised from publicly available sources for pretraining. Experiments show that BrainOmni outperforms both existing foundation models and state-of-the-art task-specific models on a range of downstream tasks. It also demonstrates strong generalisation to unseen EEG and MEG devices. Further analysis reveals that joint EEG-MEG (EMEG) training yields consistent improvements across both modalities. Code and model checkpoints will be released upon acceptance.
comment: Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ RIGNO: A Graph-based framework for robust and accurate operator learning for PDEs on arbitrary domains
Learning the solution operators of PDEs on arbitrary domains is challenging due to the diversity of possible domain shapes, in addition to the often intricate underlying physics. We propose an end-to-end graph neural network (GNN) based neural operator to learn PDE solution operators from data on point clouds in arbitrary domains. Our multi-scale model maps data between input/output point clouds by passing it through a downsampled regional mesh. The approach includes novel elements aimed at ensuring spatio-temporal resolution invariance. Our model, termed RIGNO, is tested on a challenging suite of benchmarks composed of various time-dependent and steady PDEs defined on a diverse set of domains. We demonstrate that RIGNO is significantly more accurate than neural operator baselines and robustly generalizes to unseen resolutions both in space and in time. Our code is publicly available at github.com/camlab-ethz/rigno.
♻ ☆ General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases and Populations
Demographic attributes are universally present in electronic health records. They are the most widespread information across populations and diseases, and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often treated as auxiliaries in model design, with limited attention being paid to learning their representations. This study explored the development of a General Demographic Pre-trained (GDP) model as a foundational model tailored to demographic attributes, focusing on age and gender. The model is pre-trained and evaluated using datasets with diverse diseases and populations compositions from different geographic regions. The composition of GDP architecture was explored through examining combinations of ordering approaches and encoding methods to transform tabular demographic inputs into effective latent embeddings. Results demonstrate the feasibility of GDP to generalize across task, diseases, and populations. In detailed composition, the sequential ordering substantially improves model performance in discrimination, calibration, and the corresponding information gain at each decision tree split, particularly in diseases where age and gender contribute significantly to risk stratification. Even in datasets where demographic attributes hold relatively low predictive value, GDP enhances the representational importance, increasing their influence in downstream gradient boosting models. The findings suggest that foundation models for tabular demographic attributes offer a promising direction for improving predictive performance in healthcare applications.
♻ ☆ Steering Large Language Models for Machine Translation Personalization
Large language models have simplified the production of personalized translations reflecting predefined stylistic constraints. However, these systems still struggle when stylistic requirements are implicitly represented by a set of examples, such as texts produced by a specific human translator. In this work, we explore various strategies for personalizing automatically generated translations when few examples are available, with a focus on the challenging domain of literary translation. We begin by determining the feasibility of the task and how style information is encoded within model representations. Then, we evaluate various prompting strategies and inference-time interventions for steering model generations towards a personalized style, with a particular focus on contrastive steering with sparse autoencoder (SAE) latents to identify salient personalization properties. We demonstrate that contrastive SAE steering yields robust style conditioning and translation quality, resulting in higher inference-time computational efficiency than prompting approaches. We further examine the impact of steering on model activations, finding that layers encoding personalization properties are impacted similarly by prompting and SAE steering, suggesting a similar mechanism at play.
♻ ☆ Competitive Advantage Attacks to Decentralized Federated Learning NeurIPS
Decentralized federated learning (DFL) enables clients (e.g., hospitals and banks) to jointly train machine learning models without a central orchestration server. In each global training round, each client trains a local model on its own training data and then they exchange local models for aggregation. In this work, we propose SelfishAttack, a new family of attacks to DFL. In SelfishAttack, a set of selfish clients aim to achieve competitive advantages over the remaining non-selfish ones, i.e., the final learnt local models of the selfish clients are more accurate than those of the non-selfish ones. Towards this goal, the selfish clients send carefully crafted local models to each remaining non-selfish one in each global training round. We formulate finding such local models as an optimization problem and propose methods to solve it when DFL uses different aggregation rules. Theoretically, we show that our methods find the optimal solutions to the optimization problem. Empirically, we show that SelfishAttack successfully increases the accuracy gap (i.e., competitive advantage) between the final learnt local models of selfish clients and those of non-selfish ones. Moreover, SelfishAttack achieves larger accuracy gaps than poisoning attacks when extended to increase competitive advantages.
comment: Conference on Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ Padé Approximant Neural Networks for Enhanced Electric Motor Fault Diagnosis Using Vibration and Acoustic Data
Purpose: The primary aim of this study is to enhance fault diagnosis in induction machines by leveraging the Pad\'e Approximant Neuron (PAON) model. While accelerometers and microphones are standard in motor condition monitoring, deep learning models with nonlinear neuron architectures offer promising improvements in diagnostic performance. This research investigates whether Pad\'e Approximant Neural Networks (Pad\'eNets) can outperform conventional Convolutional Neural Networks (CNNs) and Self-Organized Operational Neural Networks (Self-ONNs) in the diagnosis of electrical and mechanical faults from vibration and acoustic data. Methods: We evaluate and compare the diagnostic capabilities of three deep learning architectures: one-dimensional CNNs, Self-ONNs, and Pad\'eNets. These models are tested on the University of Ottawa's publicly available constant-speed induction motor datasets, which include both vibration and acoustic sensor data. The Pad\'eNet model is designed to introduce enhanced nonlinearity and is compatible with unbounded activation functions such as LeakyReLU. Results and Conclusion: Pad\'eNets consistently outperformed the baseline models, achieving diagnostic accuracies of 99.96%, 98.26%, 97.61%, and 98.33% for accelerometers 1, 2, 3, and the acoustic sensor, respectively. The enhanced nonlinearity of Pad\'eNets, together with their compatibility with unbounded activation functions, significantly improves fault diagnosis performance in induction motor condition monitoring.
comment: This version is the author's accepted manuscript. It has been peer-reviewed and accepted for publication in Journal of Vibration Engineering & Technologies. The final published version is available at https://doi.org/10.1007/s42417-025-02129-5
♻ ☆ GTCN-G: A Residual Graph-Temporal Fusion Network for Imbalanced Intrusion Detection (Preprint)
The escalating complexity of network threats and the inherent class imbalance in traffic data present formidable challenges for modern Intrusion Detection Systems (IDS). While Graph Neural Networks (GNNs) excel in modeling topological structures and Temporal Convolutional Networks (TCNs) are proficient in capturing time-series dependencies, a framework that synergistically integrates both while explicitly addressing data imbalance remains an open challenge. This paper introduces a novel deep learning framework, named Gated Temporal Convolutional Network and Graph (GTCN-G), engineered to overcome these limitations. Our model uniquely fuses a Gated TCN (G-TCN) for extracting hierarchical temporal features from network flows with a Graph Convolutional Network (GCN) designed to learn from the underlying graph structure. The core innovation lies in the integration of a residual learning mechanism, implemented via a Graph Attention Network (GAT). This mechanism preserves original feature information through residual connections, which is critical for mitigating the class imbalance problem and enhancing detection sensitivity for rare malicious activities (minority classes). We conducted extensive experiments on two public benchmark datasets, UNSW-NB15 and ToN-IoT, to validate our approach. The empirical results demonstrate that the proposed GTCN-G model achieves state-of-the-art performance, significantly outperforming existing baseline models in both binary and multi-class classification tasks.
comment: This preprint was submitted to IEEE TrustCom 2025. The accepted version will be published under copyright 2025 IEEE
♻ ☆ EMSEdit: Efficient Multi-Step Meta-Learning-based Model Editing
Large Language Models (LLMs) power numerous AI applications, yet updating their knowledge remains costly. Model editing provides a lightweight alternative through targeted parameter modifications, with meta-learning-based model editing (MLME) demonstrating strong effectiveness and efficiency. However, we find that MLME struggles in low-data regimes and incurs high training costs due to the use of KL divergence. To address these issues, we propose $\textbf{E}$fficient $\textbf{M}$ulti-$\textbf{S}$tep $\textbf{Edit (EMSEdit)}$, which leverages multi-step backpropagation (MSBP) to effectively capture gradient-activation mapping patterns within editing samples, performs multi-step edits per sample to enhance editing performance under limited data, and introduces norm-based regularization to preserve unedited knowledge while improving training efficiency. Experiments on two datasets and three LLMs show that EMSEdit consistently outperforms state-of-the-art methods in both sequential and batch editing. Moreover, MSBP can be seamlessly integrated into existing approaches to yield additional performance gains. Further experiments on a multi-hop reasoning editing task demonstrate EMSEdit's robustness in handling complex edits, while ablation studies validate the contribution of each design component. Our code is available at https://github.com/xpq-tech/emsedit.
♻ ☆ Cross-Modal Safety Alignment: Is textual unlearning all you need?
Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8\% and in some cases, even as low as nearly 2\% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.
comment: Accepted by EMNLP 2024 Findings
♻ ☆ Mirror Descent Actor Critic via Bounded Advantage Learning
Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entroy-regularized methods do not surpass a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor's log-density terms in the critic's loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor's log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore effective choices for the bounding functions, and show that MDAC performs better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding functions.
♻ ☆ Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples
Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithms typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.
♻ ☆ Finding Dori: Memorization in Text-to-Image Diffusion Models Is Not Local
Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such defenses. Our further analysis then provides multiple indications that memorization is indeed not inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the nature of memorization in text-to-image DMs and inform the development of more reliable mitigations against DM memorization.
♻ ☆ ICL-Router: In-Context Learned Model Representations for LLM Routing
Large language models (LLMs) often exhibit complementary strengths. Model routing harnesses these strengths by dynamically directing each query to the most suitable model, given a candidate model pool. However, routing performance relies on accurate model representations, and adding new models typically requires retraining, limiting scalability. To address these challenges, we propose a novel routing method using in-context vectors to represent model capabilities. The method proceeds in two stages. First, queries are embedded and projected into vectors, with a projector and LLM-based router trained to reconstruct the original queries, aligning vector representations with the router's semantic space. Second, each candidate model is profiled on a query set, and the router learns -- based on in-context vectors of query and model performance -- to predict whether each model can correctly answer new queries. Extensive experiments demonstrate that our method achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks. Moreover, our method allows for seamless integration of new models without retraining the router. The code is available at https://github.com/lalalamdbf/ICL-Router.
comment: preprint
♻ ☆ EgoBrain: Synergizing Minds and Eyes For Human Action Understanding
The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world's first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.
comment: 22 pages, 12 figures
♻ ☆ Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series NeurIPS 2025
Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://github.com/blacksnail789521/IMM-TSF.
comment: This paper has been accepted by the NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ An Introduction to Sliced Optimal Transport
Sliced Optimal Transport (SOT) is a rapidly developing branch of optimal transport (OT) that exploits the tractability of one-dimensional OT problems. By combining tools from OT, integral geometry, and computational statistics, SOT enables fast and scalable computation of distances, barycenters, and kernels for probability measures, while retaining rich geometric structure. This paper provides a comprehensive review of SOT, covering its mathematical foundations, methodological advances, computational methods, and applications. We discuss key concepts of OT and one-dimensional OT, the role of tools from integral geometry such as Radon transform in projecting measures, and statistical techniques for estimating sliced distances. The paper further explores recent methodological advances, including non-linear projections, improved Monte Carlo approximations, statistical estimation techniques for one-dimensional optimal transport, weighted slicing techniques, and transportation plan estimation methods. Variational problems, such as minimum sliced Wasserstein estimation, barycenters, gradient flows, kernel constructions, and embeddings are examined alongside extensions to unbalanced, partial, multi-marginal, and Gromov-Wasserstein settings. Applications span machine learning, statistics, computer graphics and computer visions, highlighting SOT's versatility as a practical computational tool. This work will be of interest to researchers and practitioners in machine learning, data sciences, and computational disciplines seeking efficient alternatives to classical OT.
comment: 259 pages
♻ ☆ Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval NeurIPS 2025
While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.
comment: NeurIPS 2025; 27 pages, 6 figures
♻ ☆ HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models
Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts. Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.
comment: Accepted by EMNLP 2025 Findings
♻ ☆ DFAMS: Dynamic-flow guided Federated Alignment based Multi-prototype Search
Federated Retrieval (FR) routes queries across multiple external knowledge sources, to mitigate hallucinations of LLMs, when necessary external knowledge is distributed. However, existing methods struggle to retrieve high-quality and relevant documents for ambiguous queries, especially in cross-domain scenarios, which significantly limits their effectiveness in supporting downstream generation tasks. Inspired by Dynamic Information Flow (DIF), we propose DFAMS, a novel framework that leverages DIF to identify latent query intents and construct semantically aligned knowledge partitions for accurate retrieval across heterogeneous sources. Specifically, DFAMS probes the DIF in LLMs by leveraging gradient signals from a few annotated queries and employing Shapley value-based attribution to trace neuron activation paths associated with intent recognition and subdomain boundary detection. Then, DFAMS leverages DIF to train an alignment module via multi-prototype contrastive learning, enabling fine-grained intra-source modeling and inter-source semantic alignment across knowledge bases. Experimental results across five benchmarks show that DFAMS outperforms advanced FR methods by up to 14.37\% in knowledge classification accuracy, 5.38\% in retrieval recall, and 6.45\% in downstream QA accuracy, demonstrating its effectiveness in complex FR scenarios. Our code are anonymous available at https://anonymous.4open.science/r/DFAMS/
comment: 8 pages, 3 figures
♻ ☆ Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak
Large Language Models (LLMs) have gained significant traction in various applications, yet their capabilities present risks for both constructive and malicious exploitation. Despite extensive training and fine-tuning efforts aimed at enhancing safety, LLMs remain susceptible to jailbreak attacks. Recently, the emergence of multi-turn attacks has intensified this vulnerability. Unlike single-turn attacks, multi-turn attacks incrementally escalate dialogue complexity, rendering them more challenging to detect and mitigate. In this study, we introduce G-Guard, an innovative attention-aware Graph Neural Network (GNN)-based input classifier specifically designed to defend against multi-turn jailbreak attacks targeting LLMs. G-Guard constructs an entity graph for multi-turn queries, which captures the interrelationships between queries and harmful keywords that present in multi-turn queries. Furthermore, we propose an attention-aware augmentation mechanism that retrieves the most relevant single-turn query based on the ongoing multi-turn conversation. The retrieved query is incorporated as a labeled node within the graph, thereby enhancing the GNN's capacity to classify the current query as harmful or benign. Evaluation results show that G-Guard consistently outperforms all baselines across diverse datasets and evaluation metrics, demonstrating its efficacy as a robust defense mechanism against multi-turn jailbreak attacks.
♻ ☆ WW-FL: Secure and Private Large-Scale Federated Learning
Federated learning (FL) is an efficient approach for large-scale distributed machine learning that promises data privacy by keeping training data on client devices. However, recent research has uncovered vulnerabilities in FL, impacting both security and privacy through poisoning attacks and the potential disclosure of sensitive information in individual model updates as well as the aggregated global model. This paper explores the inadequacies of existing FL protection measures when applied independently, and the challenges of creating effective compositions. Addressing these issues, we propose WW-FL, an innovative framework that combines secure multi-party computation (MPC) with hierarchical FL to guarantee data and global model privacy. One notable feature of WW-FL is its capability to prevent malicious clients from directly poisoning model parameters, confining them to less destructive data poisoning attacks. We furthermore provide a PyTorch-based FL implementation integrated with Meta's CrypTen MPC framework to systematically measure the performance and robustness of WW-FL. Our extensive evaluation demonstrates that WW-FL is a promising solution for secure and private large-scale federated learning.
comment: This is the full and extended version of the work, which will be published in the IACR Transactions on Cryptographic Hardware and Embedded Systems (CHES 2026)
♻ ☆ Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning ICML 2025
Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability -- often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM's abilities to re-purpose parametric knowledge in novel settings.
comment: ICML 2025 Workshop on Scaling up Intervention Models
♻ ☆ Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.
♻ ☆ GraphRAG under Fire
GraphRAG advances retrieval-augmented generation (RAG) by structuring external knowledge as multi-scale knowledge graphs, enabling language models to integrate both broad context and granular details in their generation. While GraphRAG has demonstrated success across domains, its security implications remain largely unexplored. To bridge this gap, this work examines GraphRAG's vulnerability to poisoning attacks, uncovering an intriguing security paradox: existing RAG poisoning attacks are less effective under GraphRAG than conventional RAG, due to GraphRAG's graph-based indexing and retrieval; yet, the same features also create new attack surfaces. We present GragPoison, a novel attack that exploits shared relations in the underlying knowledge graph to craft poisoning text capable of compromising multiple queries simultaneously. GragPoison employs three key strategies: (i) relation injection to introduce false knowledge, (ii) relation enhancement to amplify poisoning influence, and (iii) narrative generation to embed malicious content within coherent text. Empirical evaluation across diverse datasets and models shows that GragPoison substantially outperforms existing attacks in terms of effectiveness (up to 98% success rate) and scalability (using less than 68% poisoning text) on multiple variations of GraphRAG. We also explore potential defensive measures and their limitations, identifying promising directions for future research.
comment: 13 pages. Accepted by IEEE Symposium on Security and Privacy 2026 (S&P 2026)
♻ ☆ How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies
Learning from Demonstration (LfD) algorithms have shown promising results in robotic manipulation tasks, but their vulnerability to offline universal perturbation attacks remains underexplored. This paper presents a comprehensive study of adversarial attacks on both classic and recently proposed algorithms, including Behavior Cloning (BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP), and Vector-Quantizied Behavior Transformer (VQ-BET). We study the vulnerability of these methods to universal adversarial perturbations. Our experiments on several simulated robotic manipulation tasks reveal that most of the current methods are highly vulnerable to adversarial perturbations. We also show that these attacks are often transferable across algorithms, architectures, and tasks, raising concerning security vulnerabilities to black-box attacks. To the best of our knowledge, we are the first to present a systematic study of the vulnerabilities of different LfD algorithms to both white-box and black-box attacks. Our findings highlight the vulnerabilities of modern BC algorithms, paving the way for future work in addressing such limitations.
♻ ☆ Retrieval-Augmented Generation with Estimation of Source Reliability
Retrieval-Augmented Generation (RAG) is an effective approach to enhance the factual accuracy of large language models (LLMs) by retrieving information from external databases, which are typically composed of diverse sources, to supplement the limited internal knowledge of LLMs. However, the standard RAG often risks retrieving incorrect information, as it relies solely on relevance between a query and a document, overlooking the heterogeneous reliability of these sources. To address this issue, we propose Reliability-Aware RAG (RA-RAG), a new multi-source RAG framework that estimates the reliability of sources and leverages this information to prioritize highly reliable and relevant documents, ensuring more robust and accurate response generation. Specifically, RA-RAG first estimates source reliability by cross-checking information across multiple sources. It then retrieves documents from the top-$\kappa$ reliable and relevant sources and aggregates their information using weighted majority voting (WMV), where the selective retrieval ensures scalability while not compromising the performance. Comprehensive experiments show that RA-RAG consistently outperforms baselines in scenarios with heterogeneous source reliability while scaling efficiently as the number of sources increases. Furthermore, we demonstrate the ability of RA-RAG to estimate real-world sources' reliability, highlighting its practical applicability. \jy{Our code and data are available at \href{https://github.com/ml-postech/RA-RAG}{RA-RAG}.}
♻ ☆ Uncertainty-Supervised Interpretable and Robust Evidential Segmentation
Uncertainty estimation has been widely studied in medical image segmentation as a tool to provide reliability, particularly in deep learning approaches. However, previous methods generally lack effective supervision in uncertainty estimation, leading to low interpretability and robustness of the predictions. In this work, we propose a self-supervised approach to guide the learning of uncertainty. Specifically, we introduce three principles about the relationships between the uncertainty and the image gradients around boundaries and noise. Based on these principles, two uncertainty supervision losses are designed. These losses enhance the alignment between model predictions and human interpretation. Accordingly, we introduce novel quantitative metrics for evaluating the interpretability and robustness of uncertainty. Experimental results demonstrate that compared to state-of-the-art approaches, the proposed method can achieve competitive segmentation performance and superior results in out-of-distribution (OOD) scenarios while significantly improving the interpretability and robustness of uncertainty estimation. Code is available via https://github.com/suiannaius/SURE.
♻ ☆ A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress, Applications, and Challenges
With the global population increasing and arable land resources becoming increasingly limited, smart and precision agriculture have emerged as essential directions for sustainable agricultural development. Artificial intelligence (AI), particularly deep learning models, has been widely adopted in applications such as crop monitoring, pest detection, and yield prediction. Among recent generative models, diffusion models have demonstrated considerable potential in agricultural image processing, data augmentation, and remote sensing analysis. Compared to traditional generative adversarial networks (GANs), diffusion models exhibit greater training stability and superior image generation quality, effectively addressing challenges such as limited annotated datasets and imbalanced sample distributions in agricultural scenarios. This paper reviews recent advancements in the application of diffusion models within agriculture, focusing on their roles in crop disease and pest detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Diffusion models have been found useful in improving tasks like image generation, denoising, and data augmentation in agriculture, especially when environmental noise or variability is present. While their high computational requirements and limited generalizability across domains remain concerns, the approach is gradually proving effective in real-world applications such as precision crop monitoring. As research progresses, these models may help support sustainable agriculture and address emerging challenges in food systems.
♻ ☆ Attention as an Adaptive Filter
We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By assuming a continuous-time linear time-invariant system with simultaneously-diagonalizable state matrices and noise covariances, we can make use of a closed-form solution of the differential Lyapunov equation to efficiently propagate uncertainties through the dynamics from keys to queries. A generalization of attention naturally arises as the maximum likelihood solution for filtering the trajectory of this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated query-key precisions. We further constrain the system dynamics and noise in order to obtain a simplified variant with the same computational and memory complexity as standard attention. In the limit of zero decay and process noise, and using a small-angle approximation, we recover a complex-valued generalization of ordinary dot-product attention with rotary positional encodings.
♻ ☆ Quantum Circuit Synthesis and Compilation Optimization: Overview and Prospects
Quantum computing is a promising paradigm that may overcome the current computational power bottlenecks. The increasing maturity of quantum processors provides more possibilities for the development and implementation of quantum algorithms. As the crucial stages for quantum algorithm implementation, the logic circuit design and quantum compiling have also received significant attention, which covers key technologies, e.g., quantum logic circuit synthesis (also widely known as quantum architecture search) and optimization, as well as qubit mapping and routing. Recent studies suggest that the scale and precision of related algorithms are steadily increasing, especially with the integration of artificial intelligence methods. In this survey, we systematically review and summarize a vast body of literature, exploring the feasibility of an integrated design and optimization scheme that spans from the algorithmic level to quantum hardware, combining the steps of logic circuit design and compilation optimization. Leveraging the exceptional cognitive and learning capabilities of AI algorithms, it becomes more possible to reduce manual design costs, enhance the precision and efficiency of execution, and facilitate the implementation and validation of the superiority of quantum algorithms on hardware.
♻ ☆ The Open Source Advantage in Large Language Models (LLMs)
Large language models (LLMs) have rapidly advanced natural language processing, driving significant breakthroughs in tasks such as text generation, machine translation, and domain-specific reasoning. The field now faces a critical dilemma in its approach: closed-source models like GPT-4 deliver state-of-the-art performance but restrict reproducibility, accessibility, and external oversight, while open-source frameworks like LLaMA and Mixtral democratize access, foster collaboration, and support diverse applications, achieving competitive results through techniques like instruction tuning and LoRA. Hybrid approaches address challenges like bias mitigation and resource accessibility by combining the scalability of closed-source systems with the transparency and inclusivity of open-source framework. However, in this position paper, we argue that open-source remains the most robust path for advancing LLM research and ethical deployment.
comment: 9 pages, 1 figure
♻ ☆ CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
comment: 16 pages, 9 figures, 7 tables
♻ ☆ Enhancing the reliability of machine learning for gravitational wave parameter estimation with attention-based models
We introduce a technique to enhance the reliability of gravitational wave parameter estimation results produced by machine learning. We develop two independent machine learning models based on the Vision Transformer to estimate effective spin and chirp mass from spectrograms of gravitational wave signals from binary black hole mergers. To enhance the reliability of these models, we utilize attention maps to visualize the areas our models focus on when making predictions. This approach enables demonstrating that both models perform parameter estimation based on physically meaningful information. Furthermore, by leveraging these attention maps, we demonstrate a method to quantify the impact of glitches on parameter estimation. We show that as the models focus more on glitches, the parameter estimation results become more strongly biased. This suggests that attention maps could potentially be used to distinguish between cases where the results produced by the machine learning model are reliable and cases where they are not.
comment: 9 pages, 14 figures
♻ ☆ Toward Fair Graph Neural Networks Via Dual-Teacher Knowledge Distillation
Graph Neural Networks (GNNs) have demonstrated strong performance in graph representation learning across various real-world applications. However, they often produce biased predictions caused by sensitive attributes, such as religion or gender, an issue that has been largely overlooked in existing methods. Recently, numerous studies have focused on reducing biases in GNNs. However, these approaches often rely on training with partial data (e.g., using either node features or graph structure alone), which can enhance fairness but frequently compromises model utility due to the limited utilization of available graph information. To address this tradeoff, we propose an effective strategy to balance fairness and utility in knowledge distillation. Specifically, we introduce FairDTD, a novel Fair representation learning framework built on Dual-Teacher Distillation, leveraging a causal graph model to guide and optimize the design of the distillation process. Specifically, FairDTD employs two fairness-oriented teacher models: a feature teacher and a structure teacher, to facilitate dual distillation, with the student model learning fairness knowledge from the teachers while also leveraging full data to mitigate utility loss. To enhance information transfer, we incorporate graph-level distillation to provide an indirect supplement of graph information during training, as well as a node-specific temperature module to improve the comprehensive transfer of fair knowledge. Experiments on diverse benchmark datasets demonstrate that FairDTD achieves optimal fairness while preserving high model utility, showcasing its effectiveness in fair representation learning for GNNs.
♻ ☆ Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures
As large language models (LLMs) take on increasingly complex tasks, understanding their algorithmic reasoning abilities has become essential. However, existing evaluations focus on distinct and isolated tasks. We propose a unified diagnostic lens: structural reasoning--understanding and manipulating relationships like order, hierarchy, and connectivity. We introduce DSR-Bench, the first benchmark to systematically evaluate LLM structural reasoning through canonical data structures, which serve as interpretable, algorithmically meaningful abstractions. DSR-Bench spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination. The benchmark's hierarchical design pinpoints specific failure modes, while its fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs reveals critical limitations: the top-performing model scores only 0.498 out of 1 on challenging instances. Three additional evaluation suites reveal further weaknesses: models perform poorly on spatial data and natural language scenarios, and fail to reason over their own generated code. DSR-Bench offers a principled diagnostic tool for structural reasoning, helping expose reasoning bottlenecks and guide the development of more capable and reliable LLMs.
♻ ☆ DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting
Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer's decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC.
♻ ☆ Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets
Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10\% of examples (those where the base model fails most often) yields dramatic performance gains up to 47\%, while easy examples produce minimal improvements of 3-15\%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning
♻ ☆ IBCL: Zero-shot Model Generation under Stability-Plasticity Trade-offs
Algorithms that balance the stability-plasticity trade-off are well studied in the Continual Learning literature. However, only a few focus on obtaining models for specified trade-off preferences. When solving the problem of continual learning under specific trade-offs (CLuST), state-of-the-art techniques leverage rehearsal-based learning, which requires retraining when a model corresponding to a new trade-off preference is requested. This is inefficient, since there potentially exists a significant number of different trade-offs, and a large number of models may be requested. As a response, we propose Imprecise Bayesian Continual Learning (IBCL), an algorithm that tackles CLuST efficiently. IBCL replaces retraining with a constant-time convex combination. Given a new task, IBCL (1) updates the knowledge base as a convex hull of model parameter distributions, and (2) generates one Pareto-optimal model per given trade-off via convex combination without additional training. That is, obtaining models corresponding to specified trade-offs via IBCL is zero-shot. Experiments whose baselines are current CLuST algorithms show that IBCL improves classification by at most 44% on average per task accuracy, and by 45% on peak per task accuracy while maintaining a near-zero to positive backward transfer, with memory overheads converging to constants. In addition, its training overhead, measured by the number of batch updates, remains constant at every task, regardless of the number of preferences requested. IBCL also improves multi-objective reinforcement learning tasks by maintaining the same Pareto front hypervolume, while significantly reducing the training cost. Details can be found at: https://github.com/ibcl-anon/ibcl.
comment: Preprint of our latest version
♻ ☆ Uncertainty Estimation on Graphs with Structure Informed Stochastic Partial Differential Equations
Graph Neural Networks have achieved impressive results across diverse network modeling tasks, but accurately estimating uncertainty on graphs remains difficult, especially under distributional shifts. Unlike traditional uncertainty estimation, graph-based uncertainty must account for randomness arising from both the graph's structure and its label distribution, which adds complexity. In this paper, making an analogy between the evolution of a stochastic partial differential equation (SPDE) driven by Matern Gaussian Process and message passing using GNN layers, we present a principled way to design a novel message passing scheme that incorporates spatial-temporal noises motivated by the Gaussian Process approach to SPDE. Our method simultaneously captures uncertainty across space and time and allows explicit control over the covariance kernel smoothness, thereby enhancing uncertainty estimates on graphs with both low and high label informativeness. Our extensive experiments on Out-of-Distribution (OOD) detection on graph datasets with varying label informativeness demonstrate the soundness and superiority of our model to existing approaches.
♻ ☆ FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction
Per- and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies, robustness analysis, real-world validation, and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results and expert feedback highlight our framework's potential for scalable PFAS monitoring.
♻ ☆ Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation
We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\tilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
♻ ☆ Socially inspired Adaptive Coalition and Client Selection in Federated Learning
Federated Learning (FL) enables privacy-preserving collaborative model training, but its effectiveness is often limited by client data heterogeneity. We introduce a client-selection algorithm that (i) dynamically forms nonoverlapping coalitions of clients based on asymptotic agreement and (ii) selects one representative from each coalition to minimize the variance of model updates. Our approach is inspired by social-network modeling, leveraging homophily-based proximity matrices for spectral clustering and techniques for identifying the most informative individuals to estimate a group's aggregate opinion. We provide theoretical convergence guarantees for the algorithm under mild, standard FL assumptions. Finally, we validate our approach by benchmarking it against three strong heterogeneity-aware baselines; the results show higher accuracy and faster convergence, indicating that the framework is both theoretically grounded and effective in practice.
♻ ☆ Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection
With the deployment of Large Language Models (LLMs) in interactive applications, online malicious intent detection has become increasingly critical. However, existing approaches fall short of handling diverse and complex user queries in real time. To address these challenges, we introduce ADRAG (Adversarial Distilled Retrieval-Augmented Guard), a two-stage framework for robust and efficient online malicious intent detection. In the training stage, a high-capacity teacher model is trained on adversarially perturbed, retrieval-augmented inputs to learn robust decision boundaries over diverse and complex user queries. In the inference stage, a distillation scheduler transfers the teacher's knowledge into a compact student model, with a continually updated knowledge base collected online. At deployment, the compact student model leverages top-K similar safety exemplars retrieved from the online-updated knowledge base to enable both online and real-time malicious query detection. Evaluations across ten safety benchmarks demonstrate that ADRAG, with a 149M-parameter model, achieves 98.5% of WildGuard-7B's performance, surpasses GPT-4 by 3.3% and Llama-Guard-3-8B by 9.5% on out-of-distribution detection, while simultaneously delivering up to 5.6x lower latency at 300 queries per second (QPS) in real-time applications.
♻ ☆ Estimating the Hallucination Rate of Generative AI
This paper presents a method for estimating the hallucination rate for in-context learning (ICL) with generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and a prediction question and asked to generate a response. One interpretation of ICL assumes that the CGM computes the posterior predictive of an unknown Bayesian model, which implicitly defines a joint distribution over observable datasets and latent mechanisms. This joint distribution factorizes into two components: the model prior over mechanisms and the model likelihood of datasets given a mechanism. With this perspective, we define a hallucination as a generated response to the prediction question with low model likelihood given the mechanism. We develop a new method that takes an ICL problem and estimates the probability that a CGM will generate a hallucination. Our method only requires generating prediction questions and responses from the CGM and evaluating its response log probability. We empirically evaluate our method using large language models for synthetic regression and natural language ICL tasks.
♻ ☆ The Value of AI Advice: Personalized and Value-Maximizing AI Advisors Are Necessary to Reliably Benefit Experts and Organizations
Despite advances in AI's performance and interpretability, AI advisors can undermine experts' decisions and increase the time and effort experts must invest to make decisions. Consequently, AI systems deployed in high-stakes settings often fail to consistently add value across experts and organizations and can even diminish the value that experts alone provide. Beyond harm in specific domains, such outcomes impede progress in research and practice, underscoring the need to understand when and why different AI advisors add or diminish value. To bridge this gap, we stress the importance of assessing the value AI advice brings to real-world contexts when designing and evaluating AI advisors. Building on this perspective, we characterize key pillars -- pathways through which AI advice impacts value -- and develop a framework that incorporates these pillars to create reliable, personalized, and value-adding advisors. Our results highlight the need for value-driven development of AI advisors that advise selectively, are tailored to experts' unique behaviors, and are optimized for context-specific trade-offs between decision improvements and advising costs. They also reveal how the lack of inclusion of these pillars in the design of AI advising systems may be contributing to the failures observed in practical applications.
♻ ☆ On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse
Specifying all desirable properties of a language model is challenging, but certain requirements seem essential. Given samples from an unknown language, the trained model should produce valid strings not seen in training and be expressive enough to capture the language's full richness. Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse." We ask if a language model can meet both requirements. We investigate this within a statistical language generation setting building on Gold and Angluin. Here, the model receives random samples from a distribution over an unknown language K, which belongs to a possibly infinite collection of languages. The goal is to generate unseen strings from K. We say the model generates from K with consistency and breadth if, as training size increases, its output converges to all unseen strings in K. Kleinberg and Mullainathan [KM24] asked if consistency and breadth in language generation are possible. We answer this negatively: for a large class of language models, including next-token prediction models, this is impossible for most collections of candidate languages. This contrasts with [KM24]'s result, showing consistent generation without breadth is possible for any countable collection of languages. Our finding highlights that generation with breadth fundamentally differs from generation without breadth. As a byproduct, we establish near-tight bounds on the number of samples needed for generation with or without breadth. Finally, our results offer hope: consistent generation with breadth is achievable for any countable collection of languages when negative examples (strings outside K) are available alongside positive ones. This suggests that post-training feedback, which encodes negative examples, can be crucial in reducing hallucinations while limiting mode collapse.
comment: Accepted for presentation at the 57th Symposium on Theory of Computing (STOC 2025); v3 fixes typos
♻ ☆ LLM Probability Concentration: How Alignment Shrinks the Generative Horizon
Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this consistency in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the *Branching Factor* (BF)--a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this consistency has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., ``Sure'') that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.
comment: Codebase: https://github.com/yangalan123/LLMBranchingFactor. V2: Rewrite the theory part for a broader audience. Add experiments to verify the necessity of the AEP estimator. Generalize findings to multilingual tasks and Qwen models. Add discussions on practical implications, and on which alignment stage contributes most to BF reduction. Add ethical statements connecting pluralistic alignment
♻ ☆ Large Language Models Inference Engines based on Spiking Neural Networks
Foundational models based on the transformer architecture are currently the state-of-the-art in general language modeling, as well as in scientific areas such as material science and climate. However, training and deploying these models is computationally challenging as the time and space complexity has a quadratic relation to the input sequence length. Several efforts exploring efficient computational paradigms and model architectures to address these limitations have been made. In this work, we explore spiking neural networks (SNNs) to design transformer models. A challenge in training large-scale SNNs, using existing surrogate learning methods is inefficient and time-consuming. On the other hand, techniques to convert existing transformer-based models to their SNN equivalent are not scalable, as achieving optimal performance comes at the cost of a large number of spike time-steps, i.e. increased latency. To address this, we propose NeurTransformer, a methodology for designing transformer-based SNN for inference using a supervised fine-tuning approach with existing conversion methods. The proposed methodology works by: (1) replacing the self-attention mechanism with a spike-based self-attention (SSA), (2) converting the feed-forward block of the trained transformer model to its equivalent SNN, and (3) fine-tuning the SSA block using SNN-based surrogate learning algorithms. We benchmark the proposed methodology and demonstrate its accuracy and scalability using three variants of the GPT-2 model of increasing model size. We observe that the converted GPT-2 small models demonstrate a 5-12% loss in cosine similarity and a 9.7% reduction in perplexity. Finally, we demonstrate the energy efficiency of the SSA block compared to the ASA block and show between 64.71% and 85.28% reductions in estimated energy consumption when implementing the self-attention mechanism on a digital hardware.
♻ ☆ How Well Can Preference Optimization Generalize Under Noisy Feedback?
As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. In particular, we consider noise models that correspond to common real-world sources of noise, such as mislabeling and uncertainty. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We describe how generalization decays with different types of noise across levels of noise rates based on the preference data distribution and number of samples. Our analysis for noisy preference learning applies to a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.
♻ ☆ Can DPO Learn Diverse Human Values? A Theoretical Scaling Law NeurIPS 2025
Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.
comment: NeurIPS 2025
♻ ☆ Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon
We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions vis-\`a-vis low-norm solutions (i.e., weight decay), which knowingly do not suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of ''neural shattering'' where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.
comment: Accepted by Neurips 2025 (Spotlight)
♻ ☆ On the Capacity of Self-Attention
While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget? To formalize this, we introduce Relational Graph Recognition (RGR), where the key-query channel represents a graph on $m$ items with $m'$ directed edges, and, given a context of items, must recover the neighbors of each item. We measure resources by the total key dimension $D_K = h\,d_k$. Within this framework, we analytically derive a capacity scaling law and validate it empirically. We show that $D_K = \Theta(m' \log m' / d_{\text{model}})$ is both necessary (information-theoretic lower bound) and sufficient (explicit construction) in a broad class of graphs to recover $m'$ relations. This scaling law directly leads to a new, capacity-based rationale for multi-head attention that applies even when each item only attends to a single target. When embeddings are uncompressed ($m = d_{\text{model}}$) and the graph is a permutation, a single head suffices. However, compression ($m > d_{\text{model}}$) forces relations into overlapping subspaces, creating interference that a single large head cannot disentangle. Our analysis shows that allocating a fixed $D_K$ across many small heads mitigates this interference, increasing the number of recoverable relations. Controlled single-layer experiments mirror the theory, revealing a sharp performance threshold that matches the predicted capacity scaling and confirms the benefit of distributing $D_K$ across multiple heads. Altogether, these results provide a concrete scaling law for self-attention capacity and a principled design rule for allocating key-query budget across heads.
♻ ☆ Decentralizing Multi-Agent Reinforcement Learning with Temporal Causal Information
Reinforcement learning (RL) algorithms can find an optimal policy for a single agent to accomplish a particular task. However, many real-world problems require multiple agents to collaborate in order to achieve a common goal. For example, a robot executing a task in a warehouse may require the assistance of a drone to retrieve items from high shelves. In Decentralized Multi-Agent RL (DMARL), agents learn independently and then combine their policies at execution time, but often must satisfy constraints on compatibility of local policies to ensure that they can achieve the global task when combined. In this paper, we study how providing high-level symbolic knowledge to agents can help address unique challenges of this setting, such as privacy constraints, communication limitations, and performance concerns. In particular, we extend the formal tools used to check the compatibility of local policies with the team task, making decentralized training with theoretical guarantees usable in more scenarios. Furthermore, we empirically demonstrate that symbolic knowledge about the temporal evolution of events in the environment can significantly expedite the learning process in DMARL.
comment: Code available at https://github.com/corazza/tcdmarl
♻ ☆ OrbitZoo: Real Orbital Systems Challenges for Reinforcement Learning NeurIPS 2025
The increasing number of satellites and orbital debris has made space congestion a critical issue, threatening satellite safety and sustainability. Challenges such as collision avoidance, station-keeping, and orbital maneuvering require advanced techniques to handle dynamic uncertainties and multi-agent interactions. Reinforcement learning (RL) has shown promise in this domain, enabling adaptive, autonomous policies for space operations; however, many existing RL frameworks rely on custom-built environments developed from scratch, which often use simplified models and require significant time to implement and validate the orbital dynamics, limiting their ability to fully capture real-world complexities. To address this, we introduce OrbitZoo, a versatile multi-agent RL environment built on a high-fidelity industry standard library, that enables realistic data generation, supports scenarios like collision avoidance and cooperative maneuvers, and ensures robust and accurate orbital dynamics. The environment is validated against a real satellite constellation, Starlink, achieving a Mean Absolute Percentage Error (MAPE) of 0.16% compared to real-world data. This validation ensures reliability for generating high-fidelity simulations and enabling autonomous and independent satellite operations.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.
comment: KG and MYL contributed equally
♻ ☆ Functional Critic Modeling for Provably Convergent Off-Policy Actor-Critic
Off-policy reinforcement learning (RL) with function approximation offers an effective way to improve sample efficiency by reusing past experience. Within this setting, the actor-critic (AC) framework has achieved strong empirical success. However, both the critic and actor learning is challenging for the off-policy AC methods: first of all, in addition to the classic "deadly triad" instability of off-policy evaluation, it also suffers from a "moving target" problem, where the policy being evaluated changes continually; secondly, actor learning becomes less efficient due to the difficulty of estimating the exact off-policy policy gradient. The first challenge essentially reduces the problem to repeatedly performing off-policy evaluation for changing policies. For the second challenge, the off-policy policy gradient theorem requires a complex and often impractical algorithm to estimate an additional emphasis critic, which is typically neglected in practice, thereby reducing to the on-policy policy gradient as an approximation. In this work, we introduce a novel concept of functional critic modeling, which leads to a new AC framework that addresses both challenges for actor-critic learning under the deadly triad setting. We provide a theoretical analysis in the linear function setting, establishing the provable convergence of our framework, which, to the best of our knowledge, is the first convergent off-policy target-based AC algorithm. From a practical perspective, we further propose a carefully designed neural network architecture for the functional critic modeling and demonstrate its effectiveness through preliminary experiments on widely used RL tasks from the DeepMind Control Benchmark.
♻ ☆ Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation
We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.
♻ ☆ Joint Denoising of Cryo-EM Projection Images using Polar Transformers
Many imaging modalities involve reconstruction of unknown objects from collections of noisy projections related by random rotations. In one of these modalities, cryogenic electron microscopy (cryo-EM), the extremely low signal-to-noise ratio (SNR) makes integration of information from multiple images crucial. Existing approaches to cryo-EM processing, however, either rely on handcrafted priors or apply deep learning only on select portions of the pipeline, such as particle picking, micrograph denoising, or refinement. A fully end-to-end reconstruction approach requires a neural network architecture that integrates information from multiple images while respecting the rotational symmetry of the measurement process. In this work, we introduce the polar transformer, a new neural network architecture that combines polar representations and transformers along with a convolutional attention mechanism that preserves the rotational symmetry of the problem. We apply it to the particle-level denoising problem, where it is able to learn discriminative features in the images, enabling optimal clustering, alignment, and denoising. On simulated datasets, this achieves up to a $2\times$ reduction in mean squared error (MSE) at a signal-to-noise ratio (SNR) of $0.02$, suggesting new opportunities for data-driven approaches to reconstruction in cryo-EM and related tomographic modalities.
♻ ☆ NewVEM: A Newton Vertex Exchange Method for a Class of Constrained Self-Concordant Minimization Problems
We propose \textbf{NewVEM}, a Newton vertex exchange method for efficiently solving self-concordant minimization problems under generalized simplex constraints. The algorithm features a two-level structure: the outer loop employs a projected Newton method, and the inner loop uses a vertex exchange approach to solve strongly convex quadratic subproblems. Both loops converge locally at linear rates under technical conditions, resulting in a ``fast $\times$ fast'' framework that demonstrates high efficiency and scalability in practice. To get a feasible initial point to execute the algorithm, we also present and analyze a highly efficient semismooth Newton method for computing the projection onto the generalized simplex. The excellent practical performance of the proposed algorithms is demonstrated by a set of numerical experiments. Our results further motivate the potential real-world applications of the considered model and the proposed algorithms.
comment: 32 pages, 8 tables
♻ ☆ SoundnessBench: A Soundness Benchmark for Neural Network Verifiers
Neural network (NN) verification aims to formally verify properties of NNs, which is crucial for ensuring the behavior of NN-based models in safety-critical applications. In recent years, the community has developed many NN verifiers and benchmarks to evaluate them. However, existing benchmarks typically lack ground-truth for hard instances where no current verifier can verify the property and no counterexample can be found. This makes it difficult to validate the soundness of a verifier, when it claims verification on such challenging instances that no other verifier can handle. In this work, we develop a new benchmark for NN verification, named "SoundnessBench", specifically for testing the soundness of NN verifiers. SoundnessBench consists of instances with deliberately inserted counterexamples that are hidden from adversarial attacks commonly used to find counterexamples. Thereby, it can identify false verification claims when hidden counterexamples are known to exist. We design a training method to produce NNs with hidden counterexamples and systematically construct our SoundnessBench with instances across various model architectures, activation functions, and input data. We demonstrate that our training effectively produces hidden counterexamples and our SoundnessBench successfully identifies bugs in state-of-the-art NN verifiers. Our code is available at https://github.com/MVP-Harry/SoundnessBench and our benchmark is available at https://huggingface.co/datasets/SoundnessBench/SoundnessBench.
comment: Preprint
♻ ☆ Rademacher Meets Colors: More Expressivity, but at What Cost ?
The expressive power of graph neural networks (GNNs) is typically understood through their correspondence with graph isomorphism tests such as the Weisfeiler-Leman (WL) hierarchy. While more expressive GNNs can distinguish a richer set of graphs, they are also observed to suffer from higher generalization error. This work provides a theoretical explanation for this trade-off by linking expressivity and generalization through the lens of coloring algorithms. Specifically, we show that the number of equivalence classes induced by WL colorings directly bounds the GNNs Rademacher complexity -- a key data-dependent measure of generalization. Our analysis reveals that greater expressivity leads to higher complexity and thus weaker generalization guarantees. Furthermore, we prove that the Rademacher complexity is stable under perturbations in the color counts across different samples, ensuring robustness to sampling variability across datasets. Importantly, our framework is not restricted to message-passing GNNs or 1-WL, but extends to arbitrary GNN architectures and expressivity measures that partition graphs into equivalence classes. These results unify the study of expressivity and generalization in GNNs, providing a principled understanding of why increasing expressive power often comes at the cost of generalization.
♻ ☆ Bayesian Double Descent
Double descent is a phenomenon of over-parameterized statistical models such as deep neural networks which have a re-descending property in their risk function. As the complexity of the model increases, risk exhibits a U-shaped region due to the traditional bias-variance trade-off, then as the number of parameters equals the number of observations and the model becomes one of interpolation where the risk can be unbounded and finally, in the over-parameterized region, it re-descends -- the double descent effect. Our goal is to show that this has a natural Bayesian interpretation. We also show that this is not in conflict with the traditional Occam's razor -- simpler models are preferred to complex ones, all else being equal. Our theoretical foundations use Bayesian model selection, the Dickey-Savage density ratio, and connect generalized ridge regression and global-local shrinkage methods with double descent. We illustrate our approach for high dimensional neural networks and provide detailed treatments of infinite Gaussian means models and non-parametric regression. Finally, we conclude with directions for future research.
♻ ☆ HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model
Uncertainty quantification is critical for ensuring robustness in high-stakes machine learning applications. We introduce HybridFlow, a modular hybrid architecture that unifies the modeling of aleatoric and epistemic uncertainty by combining a Conditional Masked Autoregressive normalizing flow for estimating aleatoric uncertainty with a flexible probabilistic predictor for epistemic uncertainty. The framework supports integration with any probabilistic model class, allowing users to easily adapt HybridFlow to existing architectures without sacrificing predictive performance. HybridFlow improves upon previous uncertainty quantification frameworks across a range of regression tasks, such as depth estimation, a collection of regression benchmarks, and a scientific case study of ice sheet emulation. We also provide empirical results of the quantified uncertainty, showing that the uncertainty quantified by HybridFlow is calibrated and better aligns with model error than existing methods for quantifying aleatoric and epistemic uncertainty. HybridFlow addresses a key challenge in Bayesian deep learning, unifying aleatoric and epistemic uncertainty modeling in a single robust framework.
comment: Reviewed and published in TMLR at https://openreview.net/forum?id=xRiEdSyVjY
♻ ☆ When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning NeurIPS 2022
Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.
comment: NeurIPS 2022 Spotlight
♻ ☆ Early Signs of Steganographic Capabilities in Frontier LLMs
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances like using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.
♻ ☆ Learning Equivariant Functions via Quadratic Forms
In this study, we introduce a method for learning group (known or unknown) equivariant functions by learning the associated quadratic form $x^T A x$ corresponding to the group from the data. Certain groups, known as orthogonal groups, preserve a specific quadratic form, and we leverage this property to uncover the underlying symmetry group under the assumption that it is orthogonal. By utilizing the corresponding unique symmetric matrix and its inherent diagonal form, we incorporate suitable inductive biases into the neural network architecture, leading to models that are both simplified and efficient. Our approach results in an invariant model that preserves norms, while the equivariant model is represented as a product of a norm-invariant model and a scale-invariant model, where the ``product'' refers to the group action. Moreover, we extend our framework to a more general setting where the function acts on tuples of input vectors via a diagonal (or product) group action. In this extension, the equivariant function is decomposed into an angular component extracted solely from the normalized first vector and a scale-invariant component that depends on the full Gram matrix of the tuple. This decomposition captures the inter-dependencies between multiple inputs while preserving the underlying group symmetry. We assess the effectiveness of our framework across multiple tasks, including polynomial regression, top quark tagging, and moment of inertia matrix prediction. Comparative analysis with baseline methods demonstrates that our model consistently excels in both discovering the underlying symmetry and efficiently learning the corresponding equivariant function.
♻ ☆ A Survey of Graph Unlearning
Graph unlearning emerges as a crucial advancement in the pursuit of responsible AI, providing the means to remove sensitive data traces from trained models, thereby upholding the \textit{right to be forgotten}. It is evident that graph machine learning exhibits sensitivity to data privacy and adversarial attacks, necessitating the application of graph unlearning techniques to address these concerns effectively. In this comprehensive survey paper, we present the first systematic review of graph unlearning approaches, encompassing a diverse array of methodologies and offering a detailed taxonomy and up-to-date literature overview to facilitate the understanding of researchers new to this field. To ensure clarity, we provide lucid explanations of the fundamental concepts and evaluation measures used in graph unlearning, catering to a broader audience with varying levels of expertise. Delving into potential applications, we explore the versatility of graph unlearning across various domains, including but not limited to social networks, adversarial settings, recommender systems, and resource-constrained environments like the Internet of Things, illustrating its potential impact in safeguarding data privacy and enhancing AI systems' robustness. Finally, we shed light on promising research directions, encouraging further progress and innovation within the domain of graph unlearning. By laying a solid foundation and fostering continued progress, this survey seeks to inspire researchers to further advance the field of graph unlearning, thereby instilling confidence in the ethical growth of AI systems and reinforcing the responsible application of machine learning techniques in various domains.
comment: 15 page review paper on graph unlearning
♻ ☆ Orthogonal Finetuning Made Scalable
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in the Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
comment: EMNLP 2025 Main (18 pages, 7 figures, project page: https://spherelab.ai/oftv2/)
♻ ☆ Temporal-Difference Variational Continual Learning NeurIPS 2025
Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. In the Bayesian CL literature, variational methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution while constraining it to stay close to its previous estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. Experiments on challenging CL benchmarks show that our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
comment: Published at NeurIPS 2025
♻ ☆ D2 Actor Critic: Diffusion Actor Meets Distributional Critic
We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.
♻ ☆ Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding
Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.
♻ ☆ Of Mice and Machines: A Comparison of Learning Between Real World Mice and RL Agents ICML 2025
Recent advances in reinforcement learning (RL) have demonstrated impressive capabilities in complex decision-making tasks. This progress raises a natural question: how do these artificial systems compare to biological agents, which have been shaped by millions of years of evolution? To help answer this question, we undertake a comparative study of biological mice and RL agents in a predator-avoidance maze environment. Through this analysis, we identify a striking disparity: RL agents consistently demonstrate a lack of self-preservation instinct, readily risking ``death'' for marginal efficiency gains. These risk-taking strategies are in contrast to biological agents, which exhibit sophisticated risk-assessment and avoidance behaviors. Towards bridging this gap between the biological and artificial, we propose two novel mechanisms that encourage more naturalistic risk-avoidance behaviors in RL agents. Our approach leads to the emergence of naturalistic behaviors, including strategic environment assessment, cautious path planning, and predator avoidance patterns that closely mirror those observed in biological systems.
comment: 19 pages, ICML 2025
♻ ☆ Joint Denoising of Cryo-EM Projection Images using Polar Transformers
Many imaging modalities involve reconstruction of unknown objects from collections of noisy projections related by random rotations. In one of these modalities, cryogenic electron microscopy (cryo-EM), the extremely low signal-to-noise ratio (SNR) makes integration of information from multiple images crucial. Existing approaches to cryo-EM processing, however, either rely on handcrafted priors or apply deep learning only on select portions of the pipeline, such as particle picking, micrograph denoising, or refinement. A fully end-to-end reconstruction approach requires a neural network architecture that integrates information from multiple images while respecting the rotational symmetry of the measurement process. In this work, we introduce the polar transformer, a new neural network architecture that combines polar representations and transformers along with a convolutional attention mechanism that preserves the rotational symmetry of the problem. We apply it to the particle-level denoising problem, where it is able to learn discriminative features in the images, enabling optimal clustering, alignment, and denoising. On simulated datasets, this achieves up to a $2\times$ reduction in mean squared error (MSE) at a signal-to-noise ratio (SNR) of $0.02$, suggesting new opportunities for data-driven approaches to reconstruction in cryo-EM and related tomographic modalities.
Computer Vision and Pattern Recognition 193
☆ Ev4DGS: Novel-view Rendering of Non-Rigid Objects from Monocular Event Streams
Event cameras offer various advantages for novel view rendering compared to synchronously operating RGB cameras, and efficient event-based techniques supporting rigid scenes have been recently demonstrated in the literature. In the case of non-rigid objects, however, existing approaches additionally require sparse RGB inputs, which can be a substantial practical limitation; it remains unknown if similar models could be learned from event streams only. This paper sheds light on this challenging open question and introduces Ev4DGS, i.e., the first approach for novel view rendering of non-rigidly deforming objects in the explicit observation space (i.e., as RGB or greyscale images) from monocular event streams. Our method regresses a deformable 3D Gaussian Splatting representation through 1) a loss relating the outputs of the estimated model with the 2D event observation space, and 2) a coarse 3D deformation model trained from binary masks generated from events. We perform experimental comparisons on existing synthetic and newly recorded real datasets with non-rigid objects. The results demonstrate the validity of Ev4DGS and its superior performance compared to multiple naive baselines that can be applied in our setting. We will release our models and the datasets used in the evaluation for research purposes; see the project webpage: https://4dqv.mpi-inf.mpg.de/Ev4DGS/.
☆ CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.
☆ Point Prompting: Counterfactual Tracking with Video Diffusion Models
Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point's trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these "emergent" tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.
comment: Project link: https://point-prompting.github.io
☆ DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.
comment: https://fenghora.github.io/DiT360-Page/
☆ Adversarial Attacks Leverage Interference Between Features in Superposition
Fundamental questions remain about when and why adversarial examples arise in neural networks, with competing views characterising them either as artifacts of the irregularities in the decision landscape or as products of sensitivity to non-robust input features. In this paper, we instead argue that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, we show how superposition - where networks represent more features than they have dimensions - creates arrangements of latent representations that adversaries can exploit. We demonstrate that adversarial perturbations leverage interference between superposed features, making attack patterns predictable from feature arrangements. Our framework provides a mechanistic explanation for two known phenomena: adversarial attack transferability between models with similar training regimes and class-specific vulnerability patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. We then demonstrate that these findings persist in a ViT trained on CIFAR-10. These findings reveal adversarial vulnerability can be a byproduct of networks' representational compression, rather than flaws in the learning process or non-robust inputs.
☆ Bayesian Topological Convolutional Neural Nets
Convolutional neural networks (CNNs) have been established as the main workhorse in image data processing; nonetheless, they require large amounts of data to train, often produce overconfident predictions, and frequently lack the ability to quantify the uncertainty of their predictions. To address these concerns, we propose a new Bayesian topological CNN that promotes a novel interplay between topology-aware learning and Bayesian sampling. Specifically, it utilizes information from important manifolds to accelerate training while reducing calibration error by placing prior distributions on network parameters and properly learning appropriate posteriors. One important contribution of our work is the inclusion of a consistency condition in the learning cost, which can effectively modify the prior distributions to improve the performance of our novel network architecture. We evaluate the model on benchmark image classification datasets and demonstrate its superiority over conventional CNNs, Bayesian neural networks (BNNs), and topological CNNs. In particular, we supply evidence that our method provides an advantage in situations where training data is limited or corrupted. Furthermore, we show that the new model allows for better uncertainty quantification than standard BNNs since it can more readily identify examples of out-of-distribution data on which it has not been trained. Our results highlight the potential of our novel hybrid approach for more efficient and robust image classification.
☆ QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.
comment: Code is available at https://github.com/NVlabs/QeRL
☆ Scaling Language-Centric Omnimodal Representation Learning NeurIPS 2025
Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
comment: NeurIPS 2025
Diffusion Transformers with Representation Autoencoders
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.
comment: Technical Report; Project Page: https://rae-dit.github.io/
☆ Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View
Estimating an object's 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.
☆ FACE: Faithful Automatic Concept Extraction NeurIPS 2025
Interpreting deep neural networks through concept-based explanations offers a bridge between low-level features and high-level human-understandable semantics. However, existing automatic concept discovery methods often fail to align these extracted concepts with the model's true decision-making process, thereby compromising explanation faithfulness. In this work, we propose FACE (Faithful Automatic Concept Extraction), a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model's original and concept-based predictions. Unlike prior methods that operate solely on encoder activations, FACE incorporates classifier supervision during concept learning, enforcing predictive consistency and enabling faithful explanations. We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space. Systematic evaluations on ImageNet, COCO, and CelebA datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ InfiniHuman: Infinite 3D Human Creation with Precise Control SIGGRAPH
Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at https://yuxuan-xue.com/infini-human.
comment: Accepted to ACM SIGGRAPH Asia 2025. Project website: https://yuxuan-xue.com/infini-human
☆ PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image
Reconstructing metrically accurate humans and their surrounding scenes from a single image is crucial for virtual reality, robotics, and comprehensive 3D scene understanding. However, existing methods struggle with depth ambiguity, occlusions, and physically inconsistent contacts. To address these challenges, we introduce PhySIC, a framework for physically plausible Human-Scene Interaction and Contact reconstruction. PhySIC recovers metrically consistent SMPL-X human meshes, dense scene surfaces, and vertex-level contact maps within a shared coordinate frame from a single RGB image. Starting from coarse monocular depth and body estimates, PhySIC performs occlusion-aware inpainting, fuses visible depth with unscaled geometry for a robust metric scaffold, and synthesizes missing support surfaces like floors. A confidence-weighted optimization refines body pose, camera parameters, and global scale by jointly enforcing depth alignment, contact priors, interpenetration avoidance, and 2D reprojection consistency. Explicit occlusion masking safeguards invisible regions against implausible configurations. PhySIC is efficient, requiring only 9 seconds for joint human-scene optimization and under 27 seconds end-to-end. It naturally handles multiple humans, enabling reconstruction of diverse interactions. Empirically, PhySIC outperforms single-image baselines, reducing mean per-vertex scene error from 641 mm to 227 mm, halving PA-MPJPE to 42 mm, and improving contact F1 from 0.09 to 0.51. Qualitative results show realistic foot-floor interactions, natural seating, and plausible reconstructions of heavily occluded furniture. By converting a single image into a physically plausible 3D human-scene pair, PhySIC advances scalable 3D scene understanding. Our implementation is publicly available at https://yuxuan-xue.com/physic.
comment: Accepted to ACM SIGGraphAsia 2025. Project website: https://yuxuan-xue.com/physic
☆ IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment
Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.
comment: Equal contributions from first two authors. Project page: https://ryanchenyn.github.io/projects/IVEBench Code: https://github.com/RyanChenYN/IVEBench Dataset: https://huggingface.co/datasets/Coraxor/IVEBench
☆ NV3D: Leveraging Spatial Shape Through Normal Vector-based 3D Object Detection
Recent studies in 3D object detection for autonomous vehicles aim to enrich features through the utilization of multi-modal setups or the extraction of local patterns within LiDAR point clouds. However, multi-modal methods face significant challenges in feature alignment, and gaining features locally can be oversimplified for complex 3D object detection tasks. In this paper, we propose a novel model, NV3D, which utilizes local features acquired from voxel neighbors, as normal vectors computed per voxel basis using K-nearest neighbors (KNN) and principal component analysis (PCA). This informative feature enables NV3D to determine the relationship between the surface and pertinent target entities, including cars, pedestrians, or cyclists. During the normal vector extraction process, NV3D offers two distinct sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling, allowing elimination of up to 55% of data while maintaining performance. In addition, we applied element-wise attention fusion, which accepts voxel features as the query and value and normal vector features as the key, similar to the attention mechanism. Our method is trained on the KITTI dataset and has demonstrated superior performance in car and cyclist detection owing to their spatial shapes. In the validation set, NV3D without sampling achieves 86.60% and 80.18% mean Average Precision (mAP), greater than the baseline Voxel R-CNN by 2.61% and 4.23% mAP, respectively. With both samplings, NV3D achieves 85.54% mAP in car detection, exceeding the baseline by 1.56% mAP, despite roughly 55% of voxels being filtered out.
☆ EvoCAD: Evolutionary CAD Code Generation with Vision Language Models
Combining large language models with evolutionary computation algorithms represents a promising research direction leveraging the remarkable generative and in-context learning capabilities of LLMs with the strengths of evolutionary algorithms. In this work, we present EvoCAD, a method for generating computer-aided design (CAD) objects through their symbolic representations using vision language models and evolutionary optimization. Our method samples multiple CAD objects, which are then optimized using an evolutionary approach with vision language and reasoning language models. We assess our method using GPT-4V and GPT-4o, evaluating it on the CADPrompt benchmark dataset and comparing it to prior methods. Additionally, we introduce two new metrics based on topological properties defined by the Euler characteristic, which capture a form of semantic similarity between 3D objects. Our results demonstrate that EvoCAD outperforms previous approaches on multiple metrics, particularly in generating topologically correct objects, which can be efficiently evaluated using our two novel metrics that complement existing spatial metrics.
comment: Accepted to IEEE ICTAI 2025
☆ High-resolution Photo Enhancement in Real-time: A Laplacian Pyramid Network
Photo enhancement plays a crucial role in augmenting the visual aesthetics of a photograph. In recent years, photo enhancement methods have either focused on enhancement performance, producing powerful models that cannot be deployed on edge devices, or prioritized computational efficiency, resulting in inadequate performance for real-world applications. To this end, this paper introduces a pyramid network called LLF-LUT++, which integrates global and local operators through closed-form Laplacian pyramid decomposition and reconstruction. This approach enables fast processing of high-resolution images while also achieving excellent performance. Specifically, we utilize an image-adaptive 3D LUT that capitalizes on the global tonal characteristics of downsampled images, while incorporating two distinct weight fusion strategies to achieve coarse global image enhancement. To implement this strategy, we designed a spatial-frequency transformer weight predictor that effectively extracts the desired distinct weights by leveraging frequency features. Additionally, we apply local Laplacian filters to adaptively refine edge details in high-frequency components. After meticulously redesigning the network structure and transformer model, LLF-LUT++ not only achieves a 2.64 dB improvement in PSNR on the HDR+ dataset, but also further reduces runtime, with 4K resolution images processed in just 13 ms on a single GPU. Extensive experimental results on two benchmark datasets further show that the proposed approach performs favorably compared to state-of-the-art methods. The source code will be made publicly available at https://github.com/fengzhang427/LLF-LUT.
comment: accepted by TPAMI 2025
☆ ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.
comment: Data & Code: https://github.com/OpenGVLab/ExpVid
☆ ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training ICCV 2025
Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.
comment: ICCV 2025, Project page: https://nianticspatial.github.io/ace-g/
☆ MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.
comment: Under Review
☆ Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping
Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured with overall accuracy (OA), average accuracy (AA), and F1-score. HyperSigma achieved an OA of 34.5% (+/- 1.8%), DOFA reached 62.6% (+/- 3.5%), and the SpectralEarth model achieved an OA of 93.5% (+/- 0.8%). A compact SpectralEarth variant trained from scratch achieved 91%, highlighting the importance of model architecture for strong generalization across geographic regions and sensor platforms. These results provide a systematic evaluation of foundation models for operational hyperspectral crop mapping and outline directions for future model development.
comment: Being reviewed for WHISPERS conference ( Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing )
☆ A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation
Synthetic datasets are widely used for training urban scene recognition models, but even highly realistic renderings show a noticeable gap to real imagery. This gap is particularly pronounced when adapting to a specific target domain, such as Cityscapes, where differences in architecture, vegetation, object appearance, and camera characteristics limit downstream performance. Closing this gap with more detailed 3D modelling would require expensive asset and scene design, defeating the purpose of low-cost labelled data. To address this, we present a new framework that adapts an off-the-shelf diffusion model to a target domain using only imperfect pseudo-labels. Once trained, it generates high-fidelity, target-aligned images from semantic maps of any synthetic dataset, including low-effort sources created in hours rather than months. The method filters suboptimal generations, rectifies image-label misalignments, and standardises semantics across datasets, transforming weak synthetic data into competitive real-domain training sets. Experiments on five synthetic datasets and two real target datasets show segmentation gains of up to +8.0%pt. mIoU over state-of-the-art translation methods, making rapidly constructed synthetic datasets as effective as high-effort, time-intensive synthetic datasets requiring extensive manual design. This work highlights a valuable collaborative paradigm where fast semantic prototyping, combined with generative models, enables scalable, high-quality training data creation for urban scene understanding.
☆ SCOOP'D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy
Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP'D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as "Level 1" and "Level 2." SCOOP'D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page is at https://scoopdiff.github.io/.
comment: Project page is at https://scoopdiff.github.io/
☆ SNAP: Towards Segmenting Anything in Any Point Cloud
Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/
comment: Project Page, https://neu-vi.github.io/SNAP/
☆ How many samples to label for an application given a foundation model? Chest X-ray classification study
Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.
comment: 8 pages, 5 figures
☆ ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.
☆ Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
☆ mmWalk: Towards Multi-modal Multi-view Walking Assistance NeurIPS 2025
Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.
comment: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk
☆ LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
comment: 22 pages, 9 figures
☆ Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model NeurIPS 2025
Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.
comment: Accepted to NeurIPS 2025 Datasets and Benchmarks Track. Dataset and Code: https://github.com/RuipingL/Situat3DChange
☆ Towards Fast and Scalable Normal Integration using Continuous Components WACV
Surface normal integration is a fundamental problem in computer vision, dealing with the objective of reconstructing a surface from its corresponding normal map. Existing approaches require an iterative global optimization to jointly estimate the depth of each pixel, which scales poorly to larger normal maps. In this paper, we address this problem by recasting normal integration as the estimation of relative scales of continuous components. By constraining pixels belonging to the same component to jointly vary their scale, we drastically reduce the number of optimization variables. Our framework includes a heuristic to accurately estimate continuous components from the start, a strategy to rebalance optimization terms, and a technique to iteratively merge components to further reduce the size of the problem. Our method achieves state-of-the-art results on the standard normal integration benchmark in as little as a few seconds and achieves one-order-of-magnitude speedup over pixel-level approaches on large-resolution normal maps.
comment: Accepted by the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, first round. 17 pages, 9 figures, 6 tables
☆ AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model
In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoR
comment: Tech report of OPPO AndesVL Team
☆ VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment NeurIPS 2025
3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.
comment: Accepted by NeurIPS 2025
☆ Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion
Existing Infrared and Visible Image Fusion (IVIF) methods typically assume high-quality inputs. However, when handing degraded images, these methods heavily rely on manually switching between different pre-processing techniques. This decoupling of degradation handling and image fusion leads to significant performance degradation. In this paper, we propose a novel VLM-Guided Degradation-Coupled Fusion network (VGDCFusion), which tightly couples degradation modeling with the fusion process and leverages vision-language models (VLMs) for degradation-aware perception and guided suppression. Specifically, the proposed Specific-Prompt Degradation-Coupled Extractor (SPDCE) enables modality-specific degradation awareness and establishes a joint modeling of degradation suppression and intra-modal feature extraction. In parallel, the Joint-Prompt Degradation-Coupled Fusion (JPDCF) facilitates cross-modal degradation perception and couples residual degradation filtering with complementary cross-modal feature fusion. Extensive experiments demonstrate that our VGDCFusion significantly outperforms existing state-of-the-art fusion approaches under various degraded image scenarios. Our code is available at https://github.com/Lmmh058/VGDCFusion.
☆ Enhancing Maritime Domain Awareness on Inland Waterways: A YOLO-Based Fusion of Satellite and AIS for Vessel Characterization
Maritime Domain Awareness (MDA) for inland waterways remains challenged by cooperative system vulnerabilities. This paper presents a novel framework that fuses high-resolution satellite imagery with vessel trajectory data from the Automatic Identification System (AIS). This work addresses the limitations of AIS-based monitoring by leveraging non-cooperative satellite imagery and implementing a fusion approach that links visual detections with AIS data to identify dark vessels, validate cooperative traffic, and support advanced MDA. The You Only Look Once (YOLO) v11 object detection model is used to detect and characterize vessels and barges by vessel type, barge cover, operational status, barge count, and direction of travel. An annotated data set of 4,550 instances was developed from $5{,}973~\mathrm{mi}^2$ of Lower Mississippi River imagery. Evaluation on a held-out test set demonstrated vessel classification (tugboat, crane barge, bulk carrier, cargo ship, and hopper barge) with an F1 score of 95.8\%; barge cover (covered or uncovered) detection yielded an F1 score of 91.6\%; operational status (staged or in motion) classification reached an F1 score of 99.4\%. Directionality (upstream, downstream) yielded 93.8\% accuracy. The barge count estimation resulted in a mean absolute error (MAE) of 2.4 barges. Spatial transferability analysis across geographically disjoint river segments showed accuracy was maintained as high as 98\%. These results underscore the viability of integrating non-cooperative satellite sensing with AIS fusion. This approach enables near-real-time fleet inventories, supports anomaly detection, and generates high-quality data for inland waterway surveillance. Future work will expand annotated datasets, incorporate temporal tracking, and explore multi-modal deep learning to further enhance operational scalability.
☆ Robust Ego-Exo Correspondence with Long-Term Memory NeurIPS 2025
Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM-EEC.
comment: Accepted by NeurIPS 2025
☆ DocReward: A Document Reward Model for Structuring and Stylizing
Recent advances in agentic workflows have enabled the automation of tasks such as professional document generation. However, they primarily focus on textual quality, neglecting visual structure and style, which are crucial for readability and engagement. This gap arises mainly from the absence of suitable reward models to guide agentic workflows toward producing documents with stronger structural and stylistic quality. To address this, we propose DocReward, a document reward model that evaluates documents based on their structure and style. We construct a multi-domain dataset DocPair of 117K paired documents, covering 32 domains and 267 document types, each including a high- and low-professionalism document with identical content but different structure and style. This enables the model to evaluate professionalism comprehensively, and in a textual-quality-agnostic way. DocReward is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking. To assess the performance of reward models, we create a test dataset containing document bundles ranked by well-educated human evaluators. Notably, DocReward outperforms GPT-4o and GPT-5 in accuracy by 30.6 and 19.4 percentage points, respectively, demonstrating its superiority over baselines. In an extrinsic evaluation of document generation, DocReward achieves a significantly higher win rate of 60.8%, compared to GPT-5's 37.7% win rate, demonstrating its utility in guiding generation agents toward producing human-preferred documents.
☆ MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference NeurIPS 2025
Modeling reflections from 2D images is essential for photorealistic rendering and novel view synthesis. Recent approaches enhance Gaussian primitives with reflection-related material attributes to enable physically based rendering (PBR) with Gaussian Splatting. However, the material inference often lacks sufficient constraints, especially under limited environment modeling, resulting in illumination aliasing and reduced generalization. In this work, we revisit the problem from a multi-view perspective and show that multi-view consistent material inference with more physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting. To this end, we enforce 2D Gaussians to produce multi-view consistent material maps during deferred shading. We also track photometric variations across views to identify highly reflective regions, which serve as strong priors for reflection strength terms. To handle indirect illumination caused by inter-object occlusions, we further introduce an environment modeling strategy through ray tracing with 2DGS, enabling photorealistic rendering of indirect radiance. Experiments on widely used benchmarks show that our method faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel views synthesis.
comment: Accepted by NeurIPS 2025. Project Page: https://wen-yuan-zhang.github.io/MaterialRefGS
☆ Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment
Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.
☆ Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation ICCV
Generative Models are a valuable tool for the controlled creation of high-quality image data. Controlled diffusion models like the ControlNet have allowed the creation of labeled distributions. Such synthetic datasets can augment the original training distribution when discriminative models, like semantic segmentation, are trained. However, this augmentation effect is limited since ControlNets tend to reproduce the original training distribution. This work introduces a method to utilize data from unlabeled domains to train ControlNets by introducing the concept of uncertainty into the control mechanism. The uncertainty indicates that a given image was not part of the training distribution of a downstream task, e.g., segmentation. Thus, two types of control are engaged in the final network: an uncertainty control from an unlabeled dataset and a semantic control from the labeled dataset. The resulting ControlNet allows us to create annotated data with high uncertainty from the target domain, i.e., synthetic data from the unlabeled distribution with labels. In our scenario, we consider retinal OCTs, where typically high-quality Spectralis images are available with given ground truth segmentations, enabling the training of segmentation networks. The recent development in Home-OCT devices, however, yields retinal OCTs with lower quality and a large domain shift, such that out-of-the-pocket segmentation networks cannot be applied for this type of data. Synthesizing annotated images from the Home-OCT domain using the proposed approach closes this gap and leads to significantly improved segmentation results without adding any further supervision. The advantage of uncertainty-guidance becomes obvious when compared to style transfer: it enables arbitrary domain shifts without any strict learning of an image style. This is also demonstrated in a traffic scene experiment.
comment: Accepted for presentation at ICCV Workshops 2025, "The 4th Workshop on What is Next in Multimodal Foundation Models?" (MMFM)
☆ MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression
Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).
comment: Accepted for presentation at the 2025 Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025)
☆ InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
☆ REACT3D: Recovering Articulations for Interactive Physical 3D Scenes
Interactive 3D scenes are increasingly vital for embodied intelligence, yet existing datasets remain limited due to the labor-intensive process of annotating part segmentation, kinematic types, and motion trajectories. We present REACT3D, a scalable zero-shot framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry, enabling direct use in diverse downstream tasks. Our contributions include: (i) openable-object detection and segmentation to extract candidate movable parts from static scenes, (ii) articulation estimation that infers joint types and motion parameters, (iii) hidden-geometry completion followed by interactive object assembly, and (iv) interactive scene integration in widely supported formats to ensure compatibility with standard simulation platforms. We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes, demonstrating the effectiveness of our framework and providing a practical foundation for scalable interactive scene generation, thereby lowering the barrier to large-scale research on articulated scene understanding. Our project page is \textit{\hypersetup{urlcolor=black}\href{https://react3d.github.io/}{react3d.github.io}}.
comment: 8 pages
☆ Evaluating the effects of preprocessing, method selection, and hyperparameter tuning on SAR-based flood mapping and water depth estimation
Flood mapping and water depth estimation from Synthetic Aperture Radar (SAR) imagery are crucial for calibrating and validating hydraulic models. This study uses SAR imagery to evaluate various preprocessing (especially speckle noise reduction), flood mapping, and water depth estimation methods. The impact of the choice of method at different steps and its hyperparameters is studied by considering an ensemble of preprocessed images, flood maps, and water depth fields. The evaluation is conducted for two flood events on the Garonne River (France) in 2019 and 2021, using hydrodynamic simulations and in-situ observations as reference data. Results show that the choice of speckle filter alters flood extent estimations with variations of several square kilometers. Furthermore, the selection and tuning of flood mapping methods also affect performance. While supervised methods outperformed unsupervised ones, tuned unsupervised approaches (such as local thresholding or change detection) can achieve comparable results. The compounded uncertainty from preprocessing and flood mapping steps also introduces high variability in the water depth field estimates. This study highlights the importance of considering the entire processing pipeline, encompassing preprocessing, flood mapping, and water depth estimation methods and their associated hyperparameters. Rather than relying on a single configuration, adopting an ensemble approach and accounting for methodological uncertainty should be privileged. For flood mapping, the method choice has the most influence. For water depth estimation, the most influential processing step was the flood map input resulting from the flood mapping step and the hyperparameters of the methods.
☆ sketch2symm: Symmetry-aware sketch-to-shape generation via semantic bridging
Sketch-based 3D reconstruction remains a challenging task due to the abstract and sparse nature of sketch inputs, which often lack sufficient semantic and geometric information. To address this, we propose Sketch2Symm, a two-stage generation method that produces geometrically consistent 3D shapes from sketches. Our approach introduces semantic bridging via sketch-to-image translation to enrich sparse sketch representations, and incorporates symmetry constraints as geometric priors to leverage the structural regularity commonly found in everyday objects. Experiments on mainstream sketch datasets demonstrate that our method achieves superior performance compared to existing sketch-based reconstruction methods in terms of Chamfer Distance, Earth Mover's Distance, and F-Score, verifying the effectiveness of the proposed semantic bridging and symmetry-aware design.
☆ When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Object detection systems have traditionally relied on supervised learning with manually annotated bounding boxes, achieving high accuracy at the cost of substantial annotation investment. The emergence of Vision-Language Models (VLMs) offers an alternative paradigm enabling zero-shot detection through natural language queries, eliminating annotation requirements but operating with reduced accuracy. This paper presents the first comprehensive cost-effectiveness analysis comparing supervised detection (YOLO) with zero-shot VLM inference (Gemini Flash 2.5). Through systematic evaluation on 1,000 stratified COCO images and 200 diverse product images spanning consumer electronics and rare categories, combined with detailed Total Cost of Ownership modeling, we establish quantitative break-even thresholds governing architecture selection. Our findings reveal that supervised YOLO achieves 91.2% accuracy versus 68.5% for zero-shot Gemini on standard categories, representing a 22.7 percentage point advantage that costs $10,800 in annotation for 100-category systems. However, this advantage justifies investment only beyond 55 million inferences, equivalent to 151,000 images daily for one year. Zero-shot Gemini demonstrates 52.3% accuracy on diverse product categories (ranging from highly web-prevalent consumer electronics at 75-85% to rare specialized equipment at 25-40%) where supervised YOLO achieves 0% due to architectural constraints preventing detection of untrained classes. Cost per Correct Detection analysis reveals substantially lower per-detection costs for Gemini ($0.00050 vs $0.143) at 100,000 inferences despite accuracy deficits. We develop decision frameworks demonstrating that optimal architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics.
comment: 23 pages, 4 figures, 4 tables
☆ $Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization
Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.
comment: Accepted by NeruIPS2025
☆ Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.
☆ EEMS: Edge-Prompt Enhanced Medical Image Segmentation Based on Learnable Gating Mechanism
Medical image segmentation is vital for diagnosis, treatment planning, and disease monitoring but is challenged by complex factors like ambiguous edges and background noise. We introduce EEMS, a new model for segmentation, combining an Edge-Aware Enhancement Unit (EAEU) and a Multi-scale Prompt Generation Unit (MSPGU). EAEU enhances edge perception via multi-frequency feature extraction, accurately defining boundaries. MSPGU integrates high-level semantic and low-level spatial features using a prompt-guided approach, ensuring precise target localization. The Dual-Source Adaptive Gated Fusion Unit (DAGFU) merges edge features from EAEU with semantic features from MSPGU, enhancing segmentation accuracy and robustness. Tests on datasets like ISIC2018 confirm EEMS's superior performance and reliability as a clinical tool.
comment: Accepted by BIBM 2025
☆ Exploring and Leveraging Class Vectors for Classifier Editing NeurIPS 2025
Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce Class Vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, Class Vectors disentangle each class's adaptation in the latent space. We show that Class Vectors capture each class's semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of Class Vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.
comment: Accepted in NeurIPS 2025
☆ A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images
Microscopic characterizations, such as Scanning Electron Microscopy (SEM), are widely used in scientific research for visualizing and analyzing microstructures. Determining the scale bars is an important first step of accurate SEM analysis; however, currently, it mainly relies on manual operations, which is both time-consuming and prone to errors. To address this issue, we propose a multi-modal and automated scale bar detection and extraction framework that provides concurrent object detection, text detection and text recognition with a Large Language Model (LLM) agent. The proposed framework operates in four phases; i) Automatic Dataset Generation (Auto-DG) model to synthesize a diverse dataset of SEM images ensuring robust training and high generalizability of the model, ii) scale bar object detection, iii) information extraction using a hybrid Optical Character Recognition (OCR) system with DenseNet and Convolutional Recurrent Neural Network (CRNN) based algorithms, iv) an LLM agent to analyze and verify accuracy of the results. The proposed model demonstrates a strong performance in object detection and accurate localization with a precision of 100%, recall of 95.8%, and a mean Average Precision (mAP) of 99.2% at IoU=0.5 and 69.1% at IoU=0.5:0.95. The hybrid OCR system achieved 89% precision, 65% recall, and a 75% F1 score on the Auto-DG dataset, significantly outperforming several mainstream standalone engines, highlighting its reliability for scientific image analysis. The LLM is introduced as a reasoning engine as well as an intelligent assistant that suggests follow-up steps and verifies the results. This automated method powered by an LLM agent significantly enhances the efficiency and accuracy of scale bar detection and extraction in SEM images, providing a valuable tool for microscopic analysis and advancing the field of scientific imaging.
comment: 14 pages, 6 figures
☆ DTEA: Dynamic Topology Weaving and Instability-Driven Entropic Attenuation for Medical Image Segmentation
In medical image segmentation, skip connections are used to merge global context and reduce the semantic gap between encoder and decoder. Current methods often struggle with limited structural representation and insufficient contextual modeling, affecting generalization in complex clinical scenarios. We propose the DTEA model, featuring a new skip connection framework with the Semantic Topology Reconfiguration (STR) and Entropic Perturbation Gating (EPG) modules. STR reorganizes multi-scale semantic features into a dynamic hypergraph to better model cross-resolution anatomical dependencies, enhancing structural and semantic representation. EPG assesses channel stability after perturbation and filters high-entropy channels to emphasize clinically important regions and improve spatial attention. Extensive experiments on three benchmark datasets show our framework achieves superior segmentation accuracy and better generalization across various clinical settings. The code is available at \href{https://github.com/LWX-Research/DTEA}{https://github.com/LWX-Research/DTEA}.
comment: Accepted by BIBM 2025
☆ Nepali Sign Language Characters Recognition: Dataset Development and Deep Learning Approaches
Sign languages serve as essential communication systems for individuals with hearing and speech impairments. However, digital linguistic dataset resources for underrepresented sign languages, such as Nepali Sign Language (NSL), remain scarce. This study introduces the first benchmark dataset for NSL, consisting of 36 gesture classes with 1,500 samples per class, designed to capture the structural and visual features of the language. To evaluate recognition performance, we fine-tuned MobileNetV2 and ResNet50 architectures on the dataset, achieving classification accuracies of 90.45% and 88.78%, respectively. These findings demonstrate the effectiveness of convolutional neural networks in sign recognition tasks, particularly within low-resource settings. To the best of our knowledge, this work represents the first systematic effort to construct a benchmark dataset and assess deep learning approaches for NSL recognition, highlighting the potential of transfer learning and fine-tuning for advancing research in underexplored sign languages.
comment: 6 pages, 9 figures
LightPneumoNet: Lightweight Pneumonia Classifier
Effective pneumonia diagnosis is often challenged by the difficulty of deploying large, computationally expensive deep learning models in resource-limited settings. This study introduces LightPneumoNet, an efficient, lightweight convolutional neural network (CNN) built from scratch to provide an accessible and accurate diagnostic solution for pneumonia detection from chest X-rays. Our model was trained on a public dataset of 5,856 chest X-ray images. Preprocessing included image resizing to 224x224, grayscale conversion, and pixel normalization, with data augmentation (rotation, zoom, shear) to prevent overfitting. The custom architecture features four blocks of stacked convolutional layers and contains only 388,082 trainable parameters, resulting in a minimal 1.48 MB memory footprint. On the independent test set, our model delivered exceptional performance, achieving an overall accuracy of 0.942, precision of 0.92, and an F1-Score of 0.96. Critically, it obtained a sensitivity (recall) of 0.99, demonstrating a near-perfect ability to identify true pneumonia cases and minimize clinically significant false negatives. Notably, LightPneumoNet achieves this high recall on the same dataset where existing approaches typically require significantly heavier architectures or fail to reach comparable sensitivity levels. The model's efficiency enables deployment on low-cost hardware, making advanced computer-aided diagnosis accessible in underserved clinics and serving as a reliable second-opinion tool to improve patient outcomes.
comment: 13 pages (including references), 5 figures
☆ Investigating Identity Signals in Conversational Facial Dynamics via Disentangled Expression Features
This work investigates whether individuals can be identified solely through the pure dynamical components of their facial expressions, independent of static facial appearance. We leverage the FLAME 3D morphable model to achieve explicit disentanglement between facial shape and expression dynamics, extracting frame-by-frame parameters from conversational videos while retaining only expression and jaw coefficients. On the CANDOR dataset of 1,429 speakers in naturalistic conversations, our Conformer model with supervised contrastive learning achieves 61.14\%accuracy on 1,429-way classification -- 458 times above chance -- demonstrating that facial dynamics carry strong identity signatures. We introduce a drift-to-noise ratio (DNR) that quantifies the reliability of shape expression separation by measuring across-session shape changes relative to within-session variability. DNR strongly negatively correlates with recognition performance, confirming that unstable shape estimation compromises dynamic identification. Our findings reveal person-specific signatures in conversational facial dynamics, with implications for social perception and clinical assessment.
☆ Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos CVPR 2023
The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include `letter names', `letter sounds', and math codes include `counting', `sorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., `letter names' vs `letter sounds'). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE
comment: Published at CVPR 2023
☆ Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality are decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.
☆ FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models NeurIPS 2025
Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs' adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model's associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.
comment: 19 pages, 11 figures. Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Saudi Sign Language Translation Using T5
This paper explores the application of T5 models for Saudi Sign Language (SSL) translation using a novel dataset. The SSL dataset includes three challenging testing protocols, enabling comprehensive evaluation across different scenarios. Additionally, it captures unique SSL characteristics, such as face coverings, which pose challenges for sign recognition and translation. In our experiments, we investigate the impact of pre-training on American Sign Language (ASL) data by comparing T5 models pre-trained on the YouTubeASL dataset with models trained directly on the SSL dataset. Experimental results demonstrate that pre-training on YouTubeASL significantly improves models' performance (roughly $3\times$ in BLEU-4), indicating cross-linguistic transferability in sign language models. Our findings highlight the benefits of leveraging large-scale ASL data to improve SSL translation and provide insights into the development of more effective sign language translation systems. Our code is publicly available at our GitHub repository.
comment: 11 pages, supplementary, SPECOM 2025
☆ Generalisation of automatic tumour segmentation in histopathological whole-slide images across multiple cancer types
Deep learning is expected to aid pathologists by automating tasks such as tumour segmentation. We aimed to develop one universal tumour segmentation model for histopathological images and examine its performance in different cancer types. The model was developed using over 20 000 whole-slide images from over 4 000 patients with colorectal, endometrial, lung, or prostate carcinoma. Performance was validated in pre-planned analyses on external cohorts with over 3 000 patients across six cancer types. Exploratory analyses included over 1 500 additional patients from The Cancer Genome Atlas. Average Dice coefficient was over 80% in all validation cohorts with en bloc resection specimens and in The Cancer Genome Atlas cohorts. No loss of performance was observed when comparing the universal model with models specialised on single cancer types. In conclusion, extensive and rigorous evaluations demonstrate that generic tumour segmentation by a single model is possible across cancer types, patient populations, sample preparations, and slide scanners.
☆ BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models
As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\to$ Entity, (ii) an inverted text-only variant (Entity $\to$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.
comment: Code and Dataset to be released
☆ G2L:From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Knowledge Distillation
Recent studies in pathology foundation models have shown that scaling training data, diversifying cancer types, and increasing model size consistently improve their performance. However, giga-scale foundation models, which are trained on hundreds of thousands of slides covering tens of cancer types and contain billions of parameters, pose significant challenges for practical use due to their tremendous computational costs in both development and deployment. In this work, we present a novel strategy, named the G2L framework, to increase the performance of large-scale foundation models, which consist of only $15\%$ of the parameters of giga-scale models, to a comparable performance level of giga-scale models in cancer-specific tasks. Our approach applies knowledge distillation, transferring the capabilities of a giga-scale model to a large-scale model, using just 1K pathology slides of a target cancer (e.g., breast, prostate, etc.). The resulting distilled model not only outperformed state-of-the-art models of the same size (i.e., large-scale) across several benchmarks but also, interestingly, surpassed the giga-scale teacher and huge-scale models in some benchmarks. In addition, the distilled model exhibited a higher robustness index, indicating improved resilience to image variations originating from multiple institutions. These findings suggest that the proposed distillation approach for a large-scale model is a data- and parameter-efficient way to achieve giga-scale-level performance for cancer-specific applications without prohibitive computational burden.
☆ Reliable Cross-modal Alignment via Prototype Iterative Construction
Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2\%-14.1\%.
☆ CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above prior state of the art across both validation and test partitions. Extensive experiments reveal that the quality of the heatmap strongly influences the resulting mask quality, supporting a consistent association between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and predicting masks more precisely. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.
comment: 18 pages, 6 figures, 6 tables
☆ Multiview Manifold Evidential Fusion for PolSAR Image Classification
Polarimetric Synthetic Aperture Radar (PolSAR) covariance matrices and their extracted multi-features - such as scattering angle, entropy, texture, and boundary descriptors - provide complementary and physically interpretable information for image classification. Traditional fusion strategies typically concatenate these features or employ deep learning networks to combine them. However, the covariance matrices and multi-features, as two complementary views, lie on different manifolds with distinct geometric structures. Existing fusion methods also overlook the varying importance of different views and ignore uncertainty, often leading to unreliable predictions. To address these issues, we propose a Multiview Manifold Evidential Fusion (MMEFnet) method to effectively fuse these two views. It gives a new framework to integrate PolSAR manifold learning and evidence fusion into a unified architecture. Specifically, covariance matrices are represented on the Hermitian Positive Definite (HPD) manifold, while multi-features are modeled on the Grassmann manifold. Two different kernel metric learning networks are constructed to learn their manifold representations. Subsequently, a trusted multiview evidence fusion, replacing the conventional softmax classifier, estimates belief mass and quantifies the uncertainty of each view from the learned deep features. Finally, a Dempster-Shafer theory-based fusion strategy combines evidence, enabling a more reliable and interpretable classification. Extensive experiments on three real-world PolSAR datasets demonstrate that the proposed method consistently outperforms existing approaches in accuracy, robustness, and interpretability.
comment: The paper has 14 pages and 7 figures
☆ Validation of an Artificial Intelligence Tool for the Detection of Sperm DNA Fragmentation Using the TUNEL In Situ Hybridization Assay
Sperm DNA fragmentation (SDF) is a critical parameter in male fertility assessment that conventional semen analysis fails to evaluate. This study presents the validation of a novel artificial intelligence (AI) tool designed to detect SDF through digital analysis of phase contrast microscopy images, using the terminal deoxynucleotidyl transferase dUTP nick end labeling (TUNEL) assay as the gold standard reference. Utilising the established link between sperm morphology and DNA integrity, the present work proposes a morphology assisted ensemble AI model that combines image processing techniques with state-of-the-art transformer based machine learning models (GC-ViT) for the prediction of DNA fragmentation in sperm from phase contrast images. The ensemble model is benchmarked against a pure transformer `vision' model as well as a `morphology-only` model. Promising results show the proposed framework is able to achieve sensitivity of 60\% and specificity of 75\%. This non-destructive methodology represents a significant advancement in reproductive medicine by enabling real-time sperm selection based on DNA integrity for clinical diagnostic and therapeutic applications.
☆ video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory
Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.
Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer
Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student's learned representations by feeding them back into the frozen teacher's prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.
☆ Demystifying Numerosity in Diffusion Models -- Limitations and Remedies
Numerosity remains a challenge for state-of-the-art text-to-image generation models like FLUX and GPT-4o, which often fail to accurately follow counting instructions in text prompts. In this paper, we aim to study a fundamental yet often overlooked question: Can diffusion models inherently generate the correct number of objects specified by a textual prompt simply by scaling up the dataset and model size? To enable rigorous and reproducible evaluation, we construct a clean synthetic numerosity benchmark comprising two complementary datasets: GrayCount250 for controlled scaling studies, and NaturalCount6 featuring complex naturalistic scenes. Second, we empirically show that the scaling hypothesis does not hold: larger models and datasets alone fail to improve counting accuracy on our benchmark. Our analysis identifies a key reason: diffusion models tend to rely heavily on the noise initialization rather than the explicit numerosity specified in the prompt. We observe that noise priors exhibit biases toward specific object counts. In addition, we propose an effective strategy for controlling numerosity by injecting count-aware layout information into the noise prior. Our method achieves significant gains, improving accuracy on GrayCount250 from 20.0\% to 85.3\% and on NaturalCount6 from 74.8\% to 86.3\%, demonstrating effective generalization across settings.
☆ Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning
Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data's inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.
comment: Accepted by IJCAI 2025
☆ Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment NeurIPS 2025
Longitudinal multimodal data, including electronic health records (EHR) and sequential chest X-rays (CXRs), is critical for modeling disease progression, yet remains underutilized due to two key challenges: (1) redundancy in consecutive CXR sequences, where static anatomical regions dominate over clinically-meaningful dynamics, and (2) temporal misalignment between sparse, irregular imaging and continuous EHR data. We introduce $\texttt{DiPro}$, a novel framework that addresses these challenges through region-aware disentanglement and multi-timescale alignment. First, we disentangle static (anatomy) and dynamic (pathology progression) features in sequential CXRs, prioritizing disease-relevant changes. Second, we hierarchically align these static and dynamic CXR features with asynchronous EHR data via local (pairwise interval-level) and global (full-sequence) synchronization to model coherent progression pathways. Extensive experiments on the MIMIC dataset demonstrate that $\texttt{DiPro}$ could effectively extract temporal clinical dynamics and achieve state-of-the-art performance on both disease progression identification and general ICU prediction tasks.
comment: NeurIPS 2025 Spotlight
☆ MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps ICCV 2025
This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.
comment: Accepted at ICCV 2025, project page: https://jiahuilei.com/projects/momap/
☆ Compositional Zero-Shot Learning: A Survey
Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, ``small'' cats appear visually distinct from ``older'' ones, and ``wet'' cars differ significantly from ``wet'' cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning
comment: Survey paper with 36 pages, 8 plots and 4 figures
☆ CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
Multimodal Large Language Models (MLLMs) have achieved remarkable success in tasks such as image captioning, visual question answering, and cross-modal reasoning by integrating visual and textual modalities. However, their multimodal nature also exposes them to adversarial threats, where attackers can perturb either modality or both jointly to induce harmful, misleading, or policy violating outputs. Existing defense strategies, such as adversarial training and input purification, face notable limitations: adversarial training typically improves robustness only against known attacks while incurring high computational costs, whereas conventional purification approaches often suffer from degraded image quality and insufficient generalization to complex multimodal tasks. In this work, we focus on defending the visual modality, which frequently serves as the primary entry point for adversarial manipulation. We propose a supervised diffusion based denoising framework that leverages paired adversarial clean image datasets to fine-tune diffusion models with directional, task specific guidance. Unlike prior unsupervised purification methods such as DiffPure, our approach achieves higher quality reconstructions while significantly improving defense robustness in multimodal tasks. Furthermore, we incorporate prompt optimization as a complementary defense mechanism, enhancing resistance against diverse and unseen attack strategies. Extensive experiments on image captioning and visual question answering demonstrate that our method not only substantially improves robustness but also exhibits strong transferability to unknown adversarial attacks. These results highlight the effectiveness of supervised diffusion based denoising for multimodal defense, paving the way for more reliable and secure deployment of MLLMs in real world applications.
☆ Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution NeurIPS 2025
End-to-end autonomous driving methods aim to directly map raw sensor inputs to future driving actions such as planned trajectories, bypassing traditional modular pipelines. While these approaches have shown promise, they often operate under a one-shot paradigm that relies heavily on the current scene context, potentially underestimating the importance of scene dynamics and their temporal evolution. This limitation restricts the model's ability to make informed and adaptive decisions in complex driving scenarios. We propose a new perspective: the future trajectory of an autonomous vehicle is closely intertwined with the evolving dynamics of its environment, and conversely, the vehicle's own future states can influence how the surrounding scene unfolds. Motivated by this bidirectional relationship, we introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning in a closed-loop manner. Our method first predicts future bird's-eye view (BEV) representations to anticipate the dynamics of the surrounding scene, then leverages this foresight to generate future-context-aware trajectories. Two key components enable this: (1) future-aware planning, which injects predicted BEV features into the trajectory planner, and (2) iterative scene modeling and vehicle planning, which refines both future scene prediction and trajectory generation through collaborative optimization. Extensive experiments on the NAVSIM and nuScenes benchmarks show that SeerDrive significantly outperforms existing state-of-the-art methods.
comment: NeurIPS 2025
☆ Text-Enhanced Panoptic Symbol Spotting in CAD Drawings
With the widespread adoption of Computer-Aided Design(CAD) drawings in engineering, architecture, and industrial design, the ability to accurately interpret and analyze these drawings has become increasingly critical. Among various subtasks, panoptic symbol spotting plays a vital role in enabling downstream applications such as CAD automation and design retrieval. Existing methods primarily focus on geometric primitives within the CAD drawings to address this task, but they face following major problems: they usually overlook the rich textual annotations present in CAD drawings and they lack explicit modeling of relationships among primitives, resulting in incomprehensive understanding of the holistic drawings. To fill this gap, we propose a panoptic symbol spotting framework that incorporates textual annotations. The framework constructs unified representations by jointly modeling geometric and textual primitives. Then, using visual features extract by pretrained CNN as the initial representations, a Transformer-based backbone is employed, enhanced with a type-aware attention mechanism to explicitly model the different types of spatial dependencies between various primitives. Extensive experiments on the real-world dataset demonstrate that the proposed method outperforms existing approaches on symbol spotting tasks involving textual annotations, and exhibits superior robustness when applied to complex CAD drawings.
comment: 7 pages, 3figures. This version is the original submitted manuscript of the paper accepted by The 12th International Conference on Behavioural and Social Computing
☆ Source-Free Object Detection with Detection Transformer
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data. Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR). In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs. FRANCK comprises four key components: (1) an Objectness Score-based Sample Reweighting (OSSR) module that computes attention-based objectness scores on multi-scale encoder feature maps, reweighting the detection loss to emphasize less-recognized regions; (2) a Contrastive Learning with Matching-based Memory Bank (CMMB) module that integrates multi-level features into memory banks, enhancing class-wise contrastive learning; (3) an Uncertainty-weighted Query-fused Feature Distillation (UQFD) module that improves feature distillation through prediction quality reweighting and query feature fusion; and (4) an improved self-training pipeline with a Dynamic Teacher Updating Interval (DTUI) that optimizes pseudo-label quality. By leveraging these components, FRANCK effectively adapts a source-pre-trained DETR model to a target domain with enhanced robustness and generalization. Extensive experiments on several widely used benchmarks demonstrate that our method achieves state-of-the-art performance, highlighting its effectiveness and compatibility with DETR-based SFOD models.
comment: IEEE Transactions on Image Processing
☆ ROFI: A Deep Learning-Based Ophthalmic Sign-Preserving and Reversible Patient Face Anonymizer
Patient face images provide a convenient mean for evaluating eye diseases, while also raising privacy concerns. Here, we introduce ROFI, a deep learning-based privacy protection framework for ophthalmology. Using weakly supervised learning and neural identity translation, ROFI anonymizes facial features while retaining disease features (over 98\% accuracy, $\kappa > 0.90$). It achieves 100\% diagnostic sensitivity and high agreement ($\kappa > 0.90$) across eleven eye diseases in three cohorts, anonymizing over 95\% of images. ROFI works with AI systems, maintaining original diagnoses ($\kappa > 0.80$), and supports secure image reversal (over 98\% similarity), enabling audits and long-term care. These results show ROFI's effectiveness of protecting patient privacy in the digital medicine era.
comment: Accepted to Nature NPJ Digital Medicine
☆ LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation
This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J\&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J\&\dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.
comment: 16 pages, 9 figures
☆ Zero-shot Face Editing via ID-Attribute Decoupled Inversion ICME2025
Recent advancements in text-guided diffusion models have shown promise for general image editing via inversion techniques, but often struggle to maintain ID and structural consistency in real face editing tasks. To address this limitation, we propose a zero-shot face editing method based on ID-Attribute Decoupled Inversion. Specifically, we decompose the face representation into ID and attribute features, using them as joint conditions to guide both the inversion and the reverse diffusion processes. This allows independent control over ID and attributes, ensuring strong ID preservation and structural consistency while enabling precise facial attribute manipulation. Our method supports a wide range of complex multi-attribute face editing tasks using only text prompts, without requiring region-specific input, and operates at a speed comparable to DDIM inversion. Comprehensive experiments demonstrate its practicality and effectiveness.
comment: Accepted by ICME2025
☆ Benchmarking Deep Learning Models for Laryngeal Cancer Staging Using the LaryngealCT Dataset
Laryngeal cancer imaging research lacks standardised datasets to enable reproducible deep learning (DL) model development. We present LaryngealCT, a curated benchmark of 1,029 computed tomography (CT) scans aggregated from six collections from The Cancer Imaging Archive (TCIA). Uniform 1 mm isotropic volumes of interest encompassing the larynx were extracted using a weakly supervised parameter search framework validated by clinical experts. 3D DL architectures (3D CNN, ResNet18,50,101, DenseNet121) were benchmarked on (i) early (Tis,T1,T2) vs. advanced (T3,T4) and (ii) T4 vs. non-T4 classification tasks. 3D CNN (AUC-0.881, F1-macro-0.821) and ResNet18 (AUC-0.892, F1-macro-0.646) respectively outperformed the other models in the two tasks. Model explainability assessed using 3D GradCAMs with thyroid cartilage overlays revealed greater peri-cartilage attention in non-T4 cases and focal activations in T4 predictions. Through open-source data, pretrained models, and integrated explainability tools, LaryngealCT offers a reproducible foundation for AI-driven research to support clinical decisions in laryngeal oncology.
☆ Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
Recently, the powerful generalization ability exhibited by foundation models has brought forth new solutions for zero-shot anomaly segmentation tasks. However, guiding these foundation models correctly to address downstream tasks remains a challenge. This paper proposes a novel two-stage framework, for zero-shot anomaly segmentation tasks in industrial anomaly detection. This framework excellently leverages the powerful anomaly localization capability of CLIP and the boundary perception ability of SAM.(1) To mitigate SAM's inclination towards object segmentation, we propose the Co-Feature Point Prompt Generation (PPG) module. This module collaboratively utilizes CLIP and SAM to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than the entire object. (2) To further optimize SAM's segmentation results and mitigate rough boundaries and isolated noise, we introduce the Cascaded Prompts for SAM (CPS) module. This module employs hybrid prompts cascaded with a lightweight decoder of SAM, achieving precise segmentation of anomalous regions. Across multiple datasets, consistent experimental validation demonstrates that our approach achieves state-of-the-art zero-shot anomaly segmentation results. Particularly noteworthy is our performance on the Visa dataset, where we outperform the state-of-the-art methods by 10.3\% and 7.7\% in terms of {$F_1$-max} and AP metrics, respectively.
comment: Accepted by PRCV
☆ Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
☆ GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.
☆ GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation
Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.
comment: 22 pages
☆ The Easy Path to Robustness: Coreset Selection using Sample Hardness
Designing adversarially robust models from a data-centric perspective requires understanding which input samples are most crucial for learning resilient features. While coreset selection provides a mechanism for efficient training on data subsets, current algorithms are designed for clean accuracy and fall short in preserving robustness. To address this, we propose a framework linking a sample's adversarial vulnerability to its \textit{hardness}, which we quantify using the average input gradient norm (AIGN) over training. We demonstrate that \textit{easy} samples (with low AIGN) are less vulnerable and occupy regions further from the decision boundary. Leveraging this insight, we present EasyCore, a coreset selection algorithm that retains only the samples with low AIGN for training. We empirically show that models trained on EasyCore-selected data achieve significantly higher adversarial accuracy than those trained with competing coreset methods under both standard and adversarial training. As AIGN is a model-agnostic dataset property, EasyCore is an efficient and widely applicable data-centric method for improving adversarial robustness. We show that EasyCore achieves up to 7\% and 5\% improvement in adversarial accuracy under standard training and TRADES adversarial training, respectively, compared to existing coreset methods.
☆ High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation ICCV 2025
Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.
comment: This paper is accepted to ICCV 2025
☆ Into the Unknown: Towards using Generative Models for Sampling Priors of Environment Uncertainty for Planning in Configuration Spaces
Priors are vital for planning under partial observability, yet difficult to obtain in practice. We present a sampling-based pipeline that leverages large-scale pretrained generative models to produce probabilistic priors capturing environmental uncertainty and spatio-semantic relationships in a zero-shot manner. Conditioned on partial observations, the pipeline recovers complete RGB-D point cloud samples with occupancy and target semantics, formulated to be directly useful in configuration-space planning. We establish a Matterport3D benchmark of rooms partially visible through doorways, where a robot must navigate to an unobserved target object. Effective priors for this setting must represent both occupancy and target-location uncertainty in unobserved regions. Experiments show that our approach recovers commonsense spatial semantics consistent with ground truth, yielding diverse, clean 3D point clouds usable in motion planning, highlight the promise of generative models as a rich source of priors for robotic planning.
comment: Under Review
☆ COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models
Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present 'COCO-Tree' - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM's linguistic reasoning. COCO-Tree's beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.
comment: EMNLP 2025 (main)
☆ Frequency Domain Unlocks New Perspectives for Abdominal Medical Image Segmentation
Accurate segmentation of tumors and adjacent normal tissues in medical images is essential for surgical planning and tumor staging. Although foundation models generally perform well in segmentation tasks, they often struggle to focus on foreground areas in complex, low-contrast backgrounds, where some malignant tumors closely resemble normal organs, complicating contextual differentiation. To address these challenges, we propose the Foreground-Aware Spectrum Segmentation (FASS) framework. First, we introduce a foreground-aware module to amplify the distinction between background and the entire volume space, allowing the model to concentrate more effectively on target areas. Next, a feature-level frequency enhancement module, based on wavelet transform, extracts discriminative high-frequency features to enhance boundary recognition and detail perception. Eventually, we introduce an edge constraint module to preserve geometric continuity in segmentation boundaries. Extensive experiments on multiple medical datasets demonstrate superior performance across all metrics, validating the effectiveness of our framework, particularly in robustness under complex conditions and fine structure recognition. Our framework significantly enhances segmentation of low-contrast images, paving the way for applications in more diverse and complex medical imaging scenarios.
☆ ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation
Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
comment: Project Page: https://nenhang.github.io/ContextGen/
☆ Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency
3D Gaussian inpainting, a critical technique for numerous applications in virtual reality and multimedia, has made significant progress with pretrained diffusion models. However, ensuring multi-view consistency, an essential requirement for high-quality inpainting, remains a key challenge. In this work, we present PAInpainter, a novel approach designed to advance 3D Gaussian inpainting by leveraging perspective-aware content propagation and consistency verification across multi-view inpainted images. Our method iteratively refines inpainting and optimizes the 3D Gaussian representation with multiple views adaptively sampled from a perspective graph. By propagating inpainted images as prior information and verifying consistency across neighboring views, PAInpainter substantially enhances global consistency and texture fidelity in restored 3D scenes. Extensive experiments demonstrate the superiority of PAInpainter over existing methods. Our approach achieves superior 3D inpainting quality, with PSNR scores of 26.03 dB and 29.51 dB on the SPIn-NeRF and NeRFiller datasets, respectively, highlighting its effectiveness and generalization capability.
☆ A Survey on Agentic Multimodal Large Language Models
With the recent emergence of revolutionary autonomous agentic systems, research community is witnessing a significant shift from traditional static, passive, and domain-specific AI agents toward more dynamic, proactive, and generalizable agentic AI. Motivated by the growing interest in agentic AI and its potential trajectory toward AGI, we present a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs). In this survey, we explore the emerging paradigm of agentic MLLMs, delineating their conceptual foundations and distinguishing characteristics from conventional MLLM-based agents. We establish a conceptual framework that organizes agentic MLLMs along three fundamental dimensions: (i) Agentic internal intelligence functions as the system's commander, enabling accurate long-horizon planning through reasoning, reflection, and memory; (ii) Agentic external tool invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (iii) Agentic environment interaction further situates models within virtual or physical environments, allowing them to take actions, adapt strategies, and sustain goal-directed behavior in dynamic real-world scenarios. To further accelerate research in this area for the community, we compile open-source training frameworks, training and evaluation datasets for developing agentic MLLMs. Finally, we review the downstream applications of agentic MLLMs and outline future research directions for this rapidly evolving field. To continuously track developments in this rapidly evolving field, we will also actively update a public repository at https://github.com/HJYao00/Awesome-Agentic-MLLMs.
☆ Mixup Helps Understanding Multimodal Video Better
Multimodal video understanding plays a crucial role in tasks such as action recognition and emotion classification by combining information from different modalities. However, multimodal models are prone to overfitting strong modalities, which can dominate learning and suppress the contributions of weaker ones. To address this challenge, we first propose Multimodal Mixup (MM), which applies the Mixup strategy at the aggregated multimodal feature level to mitigate overfitting by generating virtual feature-label pairs. While MM effectively improves generalization, it treats all modalities uniformly and does not account for modality imbalance during training. Building on MM, we further introduce Balanced Multimodal Mixup (B-MM), which dynamically adjusts the mixing ratios for each modality based on their relative contributions to the learning objective. Extensive experiments on several datasets demonstrate the effectiveness of our methods in improving generalization and multimodal robustness.
☆ On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation
Self-supervised learning (SSL) has achieved remarkable success by learning meaningful representations without labeled data. However, a unified theoretical framework for understanding and comparing the efficiency of different SSL paradigms remains elusive. In this paper, we introduce a novel information-geometric framework to quantify representation efficiency. We define representation efficiency $\eta$ as the ratio between the effective intrinsic dimension of the learned representation space and its ambient dimension, where the effective dimension is derived from the spectral properties of the Fisher Information Matrix (FIM) on the statistical manifold induced by the encoder. Within this framework, we present a theoretical analysis of the Barlow Twins method. Under specific but natural assumptions, we prove that Barlow Twins achieves optimal representation efficiency ($\eta = 1$) by driving the cross-correlation matrix of representations towards the identity matrix, which in turn induces an isotropic FIM. This work provides a rigorous theoretical foundation for understanding the effectiveness of Barlow Twins and offers a new geometric perspective for analyzing SSL algorithms.
comment: 7 pages
☆ Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning
The capabilities of Large Vision-Language Models (LVLMs) have reached state-of-the-art on many visual reasoning tasks, including chart reasoning, yet they still falter on out-of-distribution (OOD) data, and degrade further when asked to produce their chain-of-thought (CoT) rationales, limiting explainability. We present Chart-RVR, a general framework that fine-tunes LVLMs to be more robust and explainable for chart reasoning by coupling Group Relative Policy Optimization (GRPO) with automatically verifiable rewards. Our framework comprises of three rewards that maximize: (i) correct chart-type classification, (ii) faithful chart table reconstruction, and (iii) process conformity. Applied to 3-billion-parameter LVLMs, Chart-RVR consistently outperforms standard supervised fine-tuning (SFT) on both in-distribution and out-of-distribution datasets, closing the OOD performance gap while improving rationale fidelity. The resulting models, the Chart-RVR-3B series, achieve state-of-the-art results on six chart-reasoning benchmarks spanning in-domain and OOD settings, surpassing all existing models of comparable size. Beyond accuracy, Chart-RVR yields more interpretable CoT rationales, strengthening trust and reliability - showcasing the power of verifiable rewards with GRPO for training reliable, interpretable chart-reasoning models.
comment: 23 pages
☆ IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation
Existing vision language models (VLMs), including GPT-4 and DALL-E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.
☆ Comparative Evaluation of Neural Network Architectures for Generalizable Human Spatial Preference Prediction in Unseen Built Environments
The capacity to predict human spatial preferences within built environments is instrumental for developing Cyber-Physical-Social Infrastructure Systems (CPSIS). A significant challenge in this domain is the generalizability of preference models, particularly their efficacy in predicting preferences within environmental configurations not encountered during training. While deep learning models have shown promise in learning complex spatial and contextual dependencies, it remains unclear which neural network architectures are most effective at generalizing to unseen layouts. To address this, we conduct a comparative study of Graph Neural Networks, Convolutional Neural Networks, and standard feedforward Neural Networks using synthetic data generated from a simplified and synthetic pocket park environment. Beginning with this illustrative case study, allows for controlled analysis of each model's ability to transfer learned preference patterns to unseen spatial scenarios. The models are evaluated based on their capacity to predict preferences influenced by heterogeneous physical, environmental, and social features. Generalizability score is calculated using the area under the precision-recall curve for the seen and unseen layouts. This generalizability score is appropriate for imbalanced data, providing insights into the suitability of each neural network architecture for preference-aware human behavior modeling in unseen built environments.
comment: The 15th International Workshop on Structural Health Monitoring (IWSHM)
☆ Towards Distribution-Shift Uncertainty Estimation for Inverse Problems with Generative Priors
Generative models have shown strong potential as data-driven priors for solving inverse problems such as reconstructing medical images from undersampled measurements. While these priors improve reconstruction quality with fewer measurements, they risk hallucinating features when test images lie outside the training distribution. Existing uncertainty quantification methods in this setting (i) require an in-distribution calibration dataset, which may not be available, (ii) provide heuristic rather than statistical estimates, or (iii) quantify uncertainty from model capacity or limited measurements rather than distribution shift. We propose an instance-level, calibration-free uncertainty indicator that is sensitive to distribution shift, requires no knowledge of the training distribution, and incurs no retraining cost. Our key hypothesis is that reconstructions of in-distribution images remain stable under random measurement variations, while reconstructions of out-of-distribution (OOD) images exhibit greater instability. We use this stability as a proxy for detecting distribution shift. Our proposed OOD indicator is efficiently computable for any computational imaging inverse problem; we demonstrate it on tomographic reconstruction of MNIST digits, where a learned proximal network trained only on digit "0" is evaluated on all ten digits. Reconstructions of OOD digits show higher variability and correspondingly higher reconstruction error, validating this indicator. These results suggest a deployment strategy that pairs generative priors with lightweight guardrails, enabling aggressive measurement reduction for in-distribution cases while automatically warning when priors are applied out of distribution.
comment: Code is available at https://github.com/voilalab/uncertainty_quantification_LPN
☆ DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects ICRA 2026
6D pose estimation of textureless objects is valuable for industrial robotic applications, yet remains challenging due to the frequent loss of depth information. Current multi-view methods either rely on depth data or insufficiently exploit multi-view geometric cues, limiting their performance. In this paper, we propose DKPMV, a pipeline that achieves dense keypoint-level fusion using only multi-view RGB images as input. We design a three-stage progressive pose optimization strategy that leverages dense multi-view keypoint geometry information. To enable effective dense keypoint fusion, we enhance the keypoint network with attentional aggregation and symmetry-aware training, improving prediction accuracy and resolving ambiguities on symmetric objects. Extensive experiments on the ROBI dataset demonstrate that DKPMV outperforms state-of-the-art multi-view RGB approaches and even surpasses the RGB-D methods in the majority of cases. The code will be available soon.
comment: 12 pages, 9 figures, submitted to ICRA 2026
☆ FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
☆ DreamMakeup: Face Makeup Customization using Latent Diffusion Models
The exponential growth of the global makeup market has paralleled advancements in virtual makeup simulation technology. Despite the progress led by GANs, their application still encounters significant challenges, including training instability and limited customization capabilities. Addressing these challenges, we introduce DreamMakup - a novel training-free Diffusion model based Makeup Customization method, leveraging the inherent advantages of diffusion models for superior controllability and precise real-image editing. DreamMakeup employs early-stopped DDIM inversion to preserve the facial structure and identity while enabling extensive customization through various conditioning inputs such as reference images, specific RGB colors, and textual descriptions. Our model demonstrates notable improvements over existing GAN-based and recent diffusion-based frameworks - improved customization, color-matching capabilities, identity preservation and compatibility with textual descriptions or LLMs with affordable computational costs.
☆ SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model
With the rapid development of diffusion models, style transfer has made remarkable progress. However, flexible and localized style editing for scene text remains an unsolved challenge. Although existing scene text editing methods have achieved text region editing, they are typically limited to content replacement and simple styles, which lack the ability of free-style transfer. In this paper, we introduce SceneTextStylizer, a novel training-free diffusion-based framework for flexible and high-fidelity style transfer of text in scene images. Unlike prior approaches that either perform global style transfer or focus solely on textual content modification, our method enables prompt-guided style transformation specifically for text regions, while preserving both text readability and stylistic consistency. To achieve this, we design a feature injection module that leverages diffusion model inversion and self-attention to transfer style features effectively. Additionally, a region control mechanism is introduced by applying a distance-based changing mask at each denoising step, enabling precise spatial control. To further enhance visual quality, we incorporate a style enhancement module based on the Fourier transform to reinforce stylistic richness. Extensive experiments demonstrate that our method achieves superior performance in scene text style transformation, outperforming existing state-of-the-art methods in both visual fidelity and text preservation.
☆ Topological Alignment of Shared Vision-Language Embedding Space
Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.
comment: 24 pages, 5 figures, 19 tables
☆ Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.
☆ rareboost3d: a synthetic lidar dataset with enhanced rare classes
Real-world point cloud datasets have made significant contributions to the development of LiDAR-based perception technologies, such as object segmentation for autonomous driving. However, due to the limited number of instances in some rare classes, the long-tail problem remains a major challenge in existing datasets. To address this issue, we introduce a novel, synthetic point cloud dataset named RareBoost3D, which complements existing real-world datasets by providing significantly more instances for object classes that are rare in real-world datasets. To effectively leverage both synthetic and real-world data, we further propose a cross-domain semantic alignment method named CSC loss that aligns feature representations of the same class across different domains. Experimental results demonstrate that this alignment significantly enhances the performance of LiDAR point cloud segmentation models over real-world data.
FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.
comment: Project page: https://soroushmehraban.github.io/FastHMR/
☆ Evaluating the Explainability of Vision Transformers in Medical Imaging MICCAI 2025
Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.
comment: Accepted at Workshop on Interpretability of Machine Intelligence in Medical Image Computing at MICCAI 2025
☆ Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning ICCV
Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.
comment: The paper was accepted at ICCV Conference 2025
☆ PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation
Accurately estimating the 3D layout of rooms is a crucial task in computer vision, with potential applications in robotics, augmented reality, and interior design. This paper proposes a novel model, PanoTPS-Net, to estimate room layout from a single panorama image. Leveraging a Convolutional Neural Network (CNN) and incorporating a Thin Plate Spline (TPS) spatial transformation, the architecture of PanoTPS-Net is divided into two stages: First, a convolutional neural network extracts the high-level features from the input images, allowing the network to learn the spatial parameters of the TPS transformation. Second, the TPS spatial transformation layer is generated to warp a reference layout to the required layout based on the predicted parameters. This unique combination empowers the model to properly predict room layouts while also generalizing effectively to both cuboid and non-cuboid layouts. Extensive experiments on publicly available datasets and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed method. The results underscore the model's accuracy in room layout estimation and emphasize the compatibility between the TPS transformation and panorama images. The robustness of the model in handling both cuboid and non-cuboid room layout estimation is evident with a 3DIoU value of 85.49, 86.16, 81.76, and 91.98 on PanoContext, Stanford-2D3D, Matterport3DLayout, and ZInD datasets, respectively. The source code is available at: https://github.com/HatemHosam/PanoTPS_Net.
☆ MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics ICCV 2025
Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model's inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.
comment: International Conference on Computer Vision, ICCV 2025
☆ Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis ICCV 2025
Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80\%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6\% in VQA accuracy while maintaining captioning quality.
comment: This paper was accepted at ICCV 2025
☆ MammoDINO: Anatomically Aware Self-Supervision for Mammographic Images
Self-supervised learning (SSL) has transformed vision encoder training in general domains but remains underutilized in medical imaging due to limited data and domain specific biases. We present MammoDINO, a novel SSL framework for mammography, pretrained on 1.4 million mammographic images. To capture clinically meaningful features, we introduce a breast tissue aware data augmentation sampler for both image-level and patch-level supervision and a cross-slice contrastive learning objective that leverages 3D digital breast tomosynthesis (DBT) structure into 2D pretraining. MammoDINO achieves state-of-the-art performance on multiple breast cancer screening tasks and generalizes well across five benchmark datasets. It offers a scalable, annotation-free foundation for multipurpose computer-aided diagnosis (CAD) tools for mammogram, helping reduce radiologists' workload and improve diagnostic efficiency in breast cancer screening.
comment: 5 pages
☆ GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality
As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce \our{} (\textbf{G}aussian \textbf{S}platting for \textbf{V}irtual \textbf{E}nvironment \textbf{R}endering and \textbf{S}cene \textbf{E}diting), a novel method designed to overcome these challenges by directly integrating an object's mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, \our{} facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.
☆ Data or Language Supervision: What Makes CLIP Better than DINO?
CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.
comment: EMNLP 2025 Findings
☆ Enhancing the Quality of 3D Lunar Maps Using JAXA's Kaguya Imagery
As global efforts to explore the Moon intensify, the need for high-quality 3D lunar maps becomes increasingly critical-particularly for long-distance missions such as NASA's Endurance mission concept, in which a rover aims to traverse 2,000 km across the South Pole-Aitken basin. Kaguya TC (Terrain Camera) images, though globally available at 10 m/pixel, suffer from altitude inaccuracies caused by stereo matching errors and JPEG-based compression artifacts. This paper presents a method to improve the quality of 3D maps generated from Kaguya TC images, focusing on mitigating the effects of compression-induced noise in disparity maps. We analyze the compression behavior of Kaguya TC imagery, and identify systematic disparity noise patterns, especially in darker regions. In this paper, we propose an approach to enhance 3D map quality by reducing residual noise in disparity images derived from compressed images. Our experimental results show that the proposed approach effectively reduces elevation noise, enhancing the safety and reliability of terrain data for future lunar missions.
comment: Presented at IEEE SMC 2025
♻ ☆ Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression NeurIPS 2025
With the rise of the fine-tuned-pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 50x compression, (b) general NLP models (RoBERTa-base, T5-base) with up to 224x compression, (c) vision models (ViT-B/32, ViT-L/14) with up to 132x compression, and (d) multi-modal models (BEiT-3) with 18x compression, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression. Code is available at https://github.com/xiaohuiwang000/UltraDelta.
comment: Accepted at NeurIPS 2025
♻ ☆ StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
♻ ☆ InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition NeurIPS 2025
Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.
comment: Accepted to NeurIPS 2025
♻ ☆ SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction NeurIPS 2025
Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.
comment: NeurIPS 2025 https://xishuxishu.github.io/SyncHuman.github.io/
♻ ☆ Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization
Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.
♻ ☆ Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
comment: The 1st version
♻ ☆ Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence. We first propose a holistic taxonomy of spatial tasks that unifies existing benchmarks and a standardized protocol for the fair evaluation of state-of-the-art proprietary and open-source models across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail even the most advanced multimodal models.
♻ ☆ ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
We present ViDRiP-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, including single patch images, automatically segmented pathology video clips, and manually segmented pathology videos. This integration closely mirrors the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the ViDRiP-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. ViDRiP-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at: https://github.com/QuIIL/ViDRiP-LLaVA.
♻ ☆ Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.
♻ ☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
comment: 30 pages, 27 figures
♻ ☆ Contrastive Representation Distillation via Multi-Scale Feature Decoupling
Knowledge distillation enhances the performance of compact student networks by transferring knowledge from more powerful teacher networks without introducing additional parameters. In the feature space, local regions within an individual global feature encode distinct yet interdependent semantic information. Previous feature-based distillation methods mainly emphasize global feature alignment while neglecting the decoupling of local regions within an individual global feature, which often results in semantic confusion and suboptimal performance. Moreover, conventional contrastive representation distillation suffers from low efficiency due to its reliance on a large memory buffer to store feature samples. To address these limitations, this work proposes MSDCRD, a model-agnostic distillation framework that systematically decouples global features into multi-scale local features and leverages the resulting semantically rich feature samples with tailored sample-wise and feature-wise contrastive losses. This design enables efficient distillation using only a single batch, eliminating the dependence on external memory. Extensive experiments demonstrate that MSDCRD achieves superior performance not only in homogeneous teacher-student settings but also in heterogeneous architectures where feature discrepancies are more pronounced, highlighting its strong generalization capability.
♻ ☆ Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval ACM MM 2025
As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
comment: This paper was originally submitted to ACM MM 2025 on April 12, 2025
♻ ☆ Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion
Diffusion models have transformed image synthesis by establishing unprecedented quality and creativity benchmarks. Nevertheless, their large-scale deployment faces challenges due to computationally intensive iterative denoising processes. Although post-training quantization (PTQ) provides an effective pathway for accelerating sampling, the iterative nature of diffusion models causes stepwise quantization errors to accumulate progressively during generation, inevitably compromising output fidelity. To address this challenge, we develop a theoretical framework that mathematically formulates error propagation in Diffusion Models (DMs), deriving per-step quantization error propagation equations and establishing the first closed-form solution for cumulative error. Building on this theoretical foundation, we propose a timestep-aware cumulative error compensation scheme. Extensive experiments on multiple image datasets demonstrate that our compensation strategy effectively mitigates error propagation, significantly enhancing existing PTQ methods. Specifically, it achieves a 1.2 PSNR improvement over SVDQuant on SDXL W4A4, while incurring only an additional $<$ 0.5\% time overhead.
♻ ☆ Beyond [cls]: Exploring the true potential of Masked Image Modeling representations
Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.
♻ ☆ DDFusion:Degradation-Decoupled Fusion Framework for Robust Infrared and Visible Images Fusion
Conventional infrared and visible image fusion(IVIF) methods often assume high-quality inputs, neglecting real-world degradations such as low-light and noise, which limits their practical applicability. To address this, we propose a Degradation-Decoupled Fusion(DDFusion) framework, which achieves degradation decoupling and jointly models degradation suppression and image fusion in a unified manner. Specifically, the Degradation-Decoupled Optimization Network(DDON) performs degradation-specific decomposition to decouple inter-degradation and degradation-information components, followed by component-specific extraction paths for effective suppression of degradation and enhancement of informative features. The Interactive Local-Global Fusion Network (ILGFN) aggregates complementary features across multi-scale pathways and alleviates performance degradation caused by the decoupling between degradation optimization and image fusion. Extensive experiments demonstrate that DDFusion achieves superior fusion performance under both clean and degraded conditions. Our code is available at https://github.com/Lmmh058/DDFusion.
♻ ☆ NeMo: Needle in a Montage for Video-Language Understanding
Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.
♻ ☆ Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning BMVC
Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.
comment: Paper has been accepted by BMVC as an Oral presentation
♻ ☆ ID-Booth: Identity-consistent Face Generation with Diffusion Models
Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at https://github.com/dariant/ID-Booth.
comment: IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2025, 14 pages
♻ ☆ Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing
Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important real-time CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.
♻ ☆ The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models NeurIPS 2025
Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and make it difficult to assess their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.
comment: NeurIPS 2025 Datasets and Benchmarks Track. Github link: https://github.com/TomSheng21/tta-vlm
♻ ☆ Surface-Aware Distilled 3D Semantic Features
Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by pairwise matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as ``left hand'' versus ``right hand'' which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities while facilitating shared mapping for an entire family of 3D shapes. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new possibly imperfect 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape's surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including 2D-to-3D and 3D-to-3D texture transfer, in-part segmentation, pose alignment, and motion transfer in low-data regimes. Unlike previous pairwise approaches, our solution constructs a joint embedding space, where both seen and unseen 3D shapes are implicitly aligned without further optimization. The code is available at https://graphics.tudelft.nl/SurfaceAware3DFeatures.
♻ ☆ Towards Robust and Realible Multimodal Fake News Detection with Incomplete Modality
Multimodal fake news detection (MFND) has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of fake news detection in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.
♻ ☆ Real-Time Position-Aware View Synthesis from Single-View Input
Recent advancements in view synthesis have significantly enhanced immersive experiences across various computer graphics and multimedia applications, including telepresence and entertainment. By enabling the generation of new perspectives from a single input view, view synthesis allows users to better perceive and interact with their environment. However, many state-of-the-art methods, while achieving high visual quality, face limitations in real-time performance, which makes them less suitable for live applications where low latency is critical. In this paper, we present a lightweight, position-aware network designed for real-time view synthesis from a single input image and a target camera pose. The proposed framework consists of a Position Aware Embedding, which efficiently maps positional information from the target pose to generate high dimensional feature maps. These feature maps, along with the input image, are fed into a Rendering Network that merges features from dual encoder branches to resolve both high and low level details, producing a realistic new view of the scene. Experimental results demonstrate that our method achieves superior efficiency and visual quality compared to existing approaches, particularly in handling complex translational movements without explicit geometric operations like warping. This work marks a step toward enabling real-time live and interactive telepresence applications.
♻ ☆ SWIFT: Semantic Watermarking for Image Forgery Thwarting
This paper proposes a novel approach towards image authentication and tampering detection by using watermarking as a communication channel for semantic information. We modify the HiDDeN deep-learning watermarking architecture to embed and extract high-dimensional real vectors representing image captions. Our method improves significantly robustness on both malign and benign edits. We also introduce a local confidence metric correlated with Message Recovery Rate, enhancing the method's practical applicability. This approach bridges the gap between traditional watermarking and passive forensic methods, offering a robust solution for image integrity verification.
comment: Accepted at IEEE WIFS 2024; Code released at : https://github.com/gautierevn/swift_watermarking
♻ ☆ OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation
Open-Vocabulary Segmentation (OVS) aims to segment classes that are not present in the training dataset. However, most existing studies assume that the training data is fixed in advance, overlooking more practical scenarios where new datasets are continuously collected over time. To address this, we first analyze how existing OVS models perform under such conditions. In this context, we explore several approaches such as retraining, fine-tuning, and continual learning but find that each of them has clear limitations. To address these issues, we propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework. ConOVS dynamically combines expert decoders based on the probability that an input sample belongs to the distribution of each incremental dataset. Through extensive experiments, we show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding the recognition capabilities of OVS models when data is collected sequentially.
♻ ☆ TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models AAAI 2026
Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.
comment: Manuscript submitted to AAAI 2026, currently under review
♻ ☆ MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.
comment: EMNLP 2025
♻ ☆ One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting NeurIPS 2025
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.
comment: 27 pages, 11 figures, to appear at NeurIPS 2025
♻ ☆ Learning Shared Representations from Unpaired Data NeurIPS 2025
Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our project page: https://shaham-lab.github.io/SUE_page.
comment: Accepted to NeurIPS 2025
♻ ☆ When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection
Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8\%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.
♻ ☆ Learned Display Radiance Fields with Lensless Cameras
Calibrating displays is a basic and regular task that content creators must perform to maintain optimal visual experience, yet it remains a troublesome issue. Measuring display characteristics from different viewpoints often requires specialized equipment and a dark room, making it inaccessible to most users. To avoid specialized hardware requirements in display calibrations, our work co-designs a lensless camera and an Implicit Neural Representation based algorithm for capturing display characteristics from various viewpoints. More specifically, our pipeline enables efficient reconstruction of light fields emitted from a display from a viewing cone of 46.6{\deg} X 37.6{\deg}. Our emerging pipeline paves the initial steps towards effortless display calibration and characterization.
♻ ☆ Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
comment: The code and model are available at https://github.com/M-E-AGI-Lab/Muddit
♻ ☆ SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
comment: 21 pages, 15 figures, 7 tables
♻ ☆ Open Vocabulary Multi-Label Video Classification ECCV 2024
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
comment: Accepted at ECCV 2024
♻ ☆ LiDAR-GS:Real-time LiDAR Re-Simulation using Gaussian Splatting
We present LiDAR-GS, a Gaussian Splatting (GS) method for real-time, high-fidelity re-simulation of LiDAR scans in public urban road scenes. Recent GS methods proposed for cameras have achieved significant advancements in real-time rendering beyond Neural Radiance Fields (NeRF). However, applying GS representation to LiDAR, an active 3D sensor type, poses several challenges that must be addressed to preserve high accuracy and unique characteristics. Specifically, LiDAR-GS designs a differentiable laser beam splatting, using range-view representation for precise surface splatting by projecting lasers onto micro cross-sections, effectively eliminating artifacts associated with local affine approximations. Furthermore, LiDAR-GS leverages Neural Gaussian Representation, which further integrate view-dependent clues, to represent key LiDAR properties that are influenced by the incident direction and external factors. Combining these practices with some essential adaptations, e.g., dynamic instances decomposition, LiDAR-GS succeeds in simultaneously re-simulating depth, intensity, and ray-drop channels, achieving state-of-the-art results in both rendering frame rate and quality on publically available large scene datasets when compared with the methods using explicit mesh or implicit NeRF. Our source code is publicly available at https://www.github.com/cqf7419/LiDAR-GS.
♻ ☆ Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.
comment: 15 pages, 3 figures, code are available at https://github.com/LEI-QI-233/HAR-in-Space
♻ ☆ VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
Aligning Vision-Language Models (VLMs) with safety standards is essential to mitigate risks arising from their multimodal complexity, where integrating vision and language unveils subtle threats beyond the reach of conventional safeguards. Inspired by the insight that reasoning across modalities is key to preempting intricate vulnerabilities, we propose a novel direction for VLM safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter, dynamically interpreting text-image interactions to deliver refined prompts that bolster safety across diverse VLM architectures without altering their core parameters. To achieve this, we devise a three-stage reasoning pipeline to synthesize a dataset that trains the rewriter to infer subtle threats, enabling tailored, actionable responses over generic refusals. Extensive experiments across three benchmarks with five VLMs reveal that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1 achieves a remarkable 43.59\% increase in average safety across five models on the SIUO benchmark.
♻ ☆ PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease
Parkinson's disease (PD) is a common neurodegenerative disorder that severely diminishes patients' quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists' expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose an end-to-end automated diagnostic method for PD, termed PD-Diag-Net, which performs risk assessment and auxiliary diagnosis directly from raw MRI scans. This framework first introduces an MRI Pre-processing Module (MRI-Processor) to mitigate inter-subject and inter-scanner variability by flexibly integrating established medical imaging preprocessing tools. It then incorporates two forms of clinical prior knowledge: (1) Brain-Region-Relevance-Prior (Relevance-Prior), which specifies brain regions strongly associated with PD; and (2) Brain-Region-Aging-Prior (Aging-Prior), which reflects the accelerated aging typically observed in PD-associated regions. Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability. Furthermore, we collected external test data from our collaborating hospital. Experimental results show that PD-Diag-Net achieves 86\% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing advanced methods by more than 20%.
♻ ☆ RoHOI: Robustness Benchmark for Human-Object Interaction Detection
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate predictions. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code are available at https://github.com/KratosWen/RoHOI.
comment: Benchmarks, datasets, and code are available at https://github.com/KratosWen/RoHOI
♻ ☆ Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
♻ ☆ Context Guided Transformer Entropy Modeling for Video Compression ICCV 2025
Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.
comment: ICCV 2025. This is an update to the camera-ready version
♻ ☆ BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning ICCV 2025
Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.
comment: Accepted to ICCV 2025
♻ ☆ GDO:Gradual Domain Osmosis
In this paper, we propose a new method called Gradual Domain Osmosis, which aims to solve the problem of smooth knowledge migration from source domain to target domain in Gradual Domain Adaptation (GDA). Traditional Gradual Domain Adaptation methods mitigate domain bias by introducing intermediate domains and self-training strategies, but often face the challenges of inefficient knowledge migration or missing data in intermediate domains. In this paper, we design an optimisation framework based on the hyperparameter $\lambda$ by dynamically balancing the loss weights of the source and target domains, which enables the model to progressively adjust the strength of knowledge migration ($\lambda$ incrementing from 0 to 1) during the training process, thus achieving cross-domain generalisation more efficiently. Specifically, the method incorporates self-training to generate pseudo-labels and iteratively updates the model by minimising a weighted loss function to ensure stability and robustness during progressive adaptation in the intermediate domain. The experimental part validates the effectiveness of the method on rotated MNIST, colour-shifted MNIST, portrait dataset and forest cover type dataset, and the results show that it outperforms existing baseline methods. The paper further analyses the impact of the dynamic tuning strategy of the hyperparameter $\lambda$ on the performance through ablation experiments, confirming the advantages of progressive domain penetration in mitigating the domain bias and enhancing the model generalisation capability. The study provides a theoretical support and practical framework for asymptotic domain adaptation and expands its application potential in dynamic environments.
comment: restoring v1 because of an incorrect replacement
♻ ☆ STAR: A Benchmark for Astronomical Star Fields Super-Resolution
Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.
♻ ☆ Goal-Based Vision-Language Driving
Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive's shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.
comment: 6 pages
♻ ☆ Learning Diffusion Models with Flexible Representation Guidance NeurIPS 2025
Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.
comment: NeurIPS 2025; Also Oral at ICML 2025 FM4LS workshop
♻ ☆ DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy ICCV 2025
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.
comment: Accepted by ICCV 2025
♻ ☆ LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation
Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.
♻ ☆ VideoAds for Fast-Paced Video Understanding ICCV2025
Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at https://videoadsbenchmark.netlify.app.
comment: ICCV2025
♻ ☆ Vision-Language Cross-Attention for Real-Time Autonomous Driving
Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.
comment: 5 pages
♻ ☆ LSP-ST: Ladder Shape-Biased Side-Tuning for Robust Infrared Small Target Detection
Fine-tuning the Segment Anything Model (SAM) for infrared small target detection poses significant challenges due to severe domain shifts. Existing adaptation methods often incorporate handcrafted priors to bridge this gap, yet such designs limit generalization and scalability. We identify a fundamental texture bias in foundation models, which overly depend on local texture cues for target localization. To address this, we propose Ladder Shape-Biased Side-Tuning (LSP-ST), a novel approach that introduces a shape-aware inductive bias to facilitate effective adaptation beyond texture cues. In contrast to prior work that injects explicit edge or contour features, LSP-ST models shape as a global structural prior, integrating both boundaries and internal layouts. We design a Shape-Enhanced Large-Kernel Attention Module to hierarchically and implicitly capture structural information in a fully differentiable manner, without task-specific handcrafted guidance. A theoretical analysis grounded in matched filtering and backpropagation reveals the mechanism by which the proposed attention improves structure-aware learning. With only 4.72M learnable parameters, LSP-ST achieves state-of-the-art performance on multiple infrared small target detection benchmarks. Furthermore, its strong generalization is validated across tasks such as mirror detection, shadow detection, and camouflaged object detection, while maintaining stable performance on texture-driven tasks like salient object detection, demonstrating that the introduced shape bias complements rather than competes with texture-based reasoning.
♻ ☆ Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement NeurIPS 2025
Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.In Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency reconstruction.Stage Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent structure.At inference, a tunable parameter {\alpha} enables flexible fidelity-perception trade-offs.Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.
comment: Accepted by NeurIPS 2025
♻ ☆ SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance
Low signal-to-noise ratio videos -- such as those from underwater sonar, ultrasound, and microscopy -- pose significant challenges for computer vision models, particularly when paired clean imagery is unavailable. We present Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a novel self-supervised method that denoises low-SNR sensor videos using only raw noisy data. By leveraging distinctions between foreground and background motion and exaggerating objects with stronger motion signal, SAVeD enhances foreground object visibility and reduces background and camera noise without requiring clean video. SAVeD has a set of architectural optimizations that lead to faster throughput, training, and inference than existing deep learning methods. We also introduce a new denoising metric, FBD, which indicates foreground-background divergence for detection datasets without requiring clean imagery. Our approach achieves state-of-the-art results for classification, detection, tracking, and counting tasks, and it does so with fewer training resource requirements than existing deep-learning-based denoising methods. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD
comment: Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD
♻ ☆ TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity
AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. However, the widespread adoption and advancing capabilities of generative image editing tools have amplified malicious tampering risks, while simultaneously posing new challenges to passive tampering detection and watermark robustness. To address these challenges, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results demonstrate that TAG-WM achieves state-of-the-art performance in both tampering robustness and localization capability even under distortion, while preserving lossless generation quality and maintaining a watermark capacity of 256 bits. The code is available at: https://github.com/Suchenl/TAG-WM.
♻ ☆ TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
comment: Project page: https://talkcuts.github.io/
♻ ☆ FormCoach: Lift Smarter, Not Harder
Good form is the difference between strength and strain, yet for the fast-growing community of at-home fitness enthusiasts, expert feedback is often out of reach. FormCoach transforms a simple camera into an always-on, interactive AI training partner, capable of spotting subtle form errors and delivering tailored corrections in real time, leveraging vision-language models (VLMs). We showcase this capability through a web interface and benchmark state-of-the-art VLMs on a dataset of 1,700 expert-annotated user-reference video pairs spanning 22 strength and mobility exercises. To accelerate research in AI-driven coaching, we release both the dataset and an automated, rubric-based evaluation pipeline, enabling standardized comparison across models. Our benchmarks reveal substantial gaps compared to human-level coaching, underscoring both the challenges and opportunities in integrating nuanced, context-aware movement analysis into interactive AI systems. By framing form correction as a collaborative and creative process between humans and machines, FormCoach opens a new frontier in embodied AI.
♻ ☆ SMC++: Masked Learning of Unsupervised Video Semantic Compression ICCV 2023
Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets. \textit{Codes and model are available at: https://github.com/tianyuan168326/VideoSemanticCompression-Pytorch.
comment: Accepted to TPAMI; Substantial Extension of ICCV 2023 paper
♻ ☆ Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models ICCV 2025
Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.
comment: Accepted to ICCV 2025 Workshop CVAMD
♻ ☆ LH2Face: Loss function for Hard High-quality Face
In current practical face authentication systems, most face recognition (FR) algorithms are based on cosine similarity with softmax classification. Despite its reliable classification performance, this method struggles with hard samples. A popular strategy to improve FR performance is incorporating angular or cosine margins. However, it does not take face quality or recognition hardness into account, simply increasing the margin value and thus causing an overly uniform training strategy. To address this problem, a novel loss function is proposed, named Loss function for Hard High-quality Face (LH2Face). Firstly, a similarity measure based on the von Mises-Fisher (vMF) distribution is stated, specifically focusing on the logarithm of the Probability Density Function (PDF), which represents the distance between a probability distribution and a vector. Then, an adaptive margin-based multi-classification method using softmax, called the Uncertainty-Aware Margin Function, is implemented in the article. Furthermore, proxy-based loss functions are used to apply extra constraints between the proxy and sample to optimize their representation space distribution. Finally, a renderer is constructed that optimizes FR through face reconstruction and vice versa. Our LH2Face is superior to similiar schemes on hard high-quality face datasets, achieving 49.39% accuracy on the IJB-B dataset, which surpasses the second-place method by 2.37%.
♻ ☆ One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
Image captioning is fundamental for applications like video-grounded chatbot systems and navigation robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal LLMs (MLLMs). To address this, we first build lightweight captioning models using a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluate their performance not only on single-sentence but on detailed captioning tasks. We obtain surprising results showing that our model can achieve performance comparable to MLLMs, suggesting its potential to serve as a strong captioning specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from occasional captioning errors. We investigate the underlying causes and observe that the problems stem from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality by refining coarse descriptions into more precise captions. At its core, DeepLens improves visual grounding by re-examining the informative regions identified in the initial glance. Experimental results demonstrate the superiority of our model over both recent lightweight captioning methods and MLLMs in detailed captioning and even in long-range video QA tasks.
comment: Project page: https://sites.google.com/view/junha/lightweightcaptioner
♻ ☆ Generating Multi-Image Synthetic Data for Text-to-Image Customization ICCV 2025
Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.
comment: ICCV 2025. Project webpage: https://www.cs.cmu.edu/~syncd-project/
♻ ☆ Precise Mobile Manipulation of Small Everyday Objects
Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments. Naively doing so fails because of occlusion by the end-effector. SVM mitigates this using vision models that out-paint the end-effector, thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module for SVM to seek semantic targets (e.g. knobs) and point tracking methods can help SVM reliably pursue interaction sites indicated by user clicks. We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances. SVM obtains a 71% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method by an absolute 42% and an imitation learning baseline trained on 1000+ demonstrations also by an absolute success rate of 50%.
comment: Project webpage: https://arjung128.github.io/svm
♻ ☆ Normalize Filters! Classical Wisdom for Deep Vision
Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.
♻ ☆ SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space
The rapid growth of digital pathology and advances in self-supervised deep learning have enabled the development of foundational models for various pathology tasks across diverse diseases. While multimodal approaches integrating diverse data sources have emerged, a critical gap remains in the comprehensive integration of whole-slide images (WSIs) with spatial transcriptomics (ST), which is crucial for capturing critical molecular heterogeneity beyond standard hematoxylin & eosin (H&E) staining. We introduce SPADE, a foundation model that integrates histopathology with ST data to guide image representation learning within a unified framework, in effect creating an ST-informed latent space. SPADE leverages a mixture-of-data experts technique, where experts are created via two-stage imaging feature-space clustering using contrastive learning to learn representations of co-registered WSI patches and gene expression profiles. Pre-trained on the comprehensive HEST-1k dataset, SPADE is evaluated on 20 downstream tasks, demonstrating significantly superior few-shot performance compared to baseline models, highlighting the benefits of integrating morphological and molecular information into one latent space. Code and pretrained weights are available at https://github.com/uclabair/SPADE.
♻ ☆ LongLive: Real-time Interactive Long Video Generation
We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
comment: Code, model, and demos are available at https://github.com/NVlabs/LongLive
♻ ☆ Generate, Transduct, Adapt: Iterative Transduction with VLMs
Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning. Code is released at https://github.com/cvl-umass/GTA-CLIP
comment: Code is released at https://github.com/cvl-umass/GTA-CLIP
♻ ☆ Semi-Unsupervised Microscopy Segmentation with Fuzzy Logic and Spatial Statistics for Cross-Domain Analysis Using a GUI
Brightfield microscopy of unstained live cells is challenging due to low contrast, dynamic morphology, uneven illumination, and lack of labels. Deep learning achieved SOTA performance on stained, high-contrast images but needs large labeled datasets, expensive hardware, and fails under uneven illumination. This study presents a low-cost, lightweight, annotation-free segmentation method by introducing one-time calibration-assisted unsupervised framework adaptable across imaging modalities and image type. The framework determines background via spatial standard deviation from the local mean. Uncertain pixels are resolved using fuzzy logic, cumulative squared shift of nodal intensity, statistical features, followed by post-segmentation denoising calibration which is saved as a profile for reuse until noise pattern or object type substantially change. The program runs as a script or graphical interface for non-programmers. The method was rigorously evaluated using \textit{IoU}, \textit{F1-score}, and other metrics, with statistical significance confirmed via Wilcoxon signed-rank tests. On unstained brightfield myoblast (C2C12) images, it outperformed \textit{Cellpose 3.0} and \textit{StarDist}, improving IoU by up to 48\% (average IoU = 0.43, F1 = 0.60). In phase-contrast microscopy, it achieved a mean IoU of 0.69 and an F1-score of 0.81 on the \textit{LIVECell} dataset ($n = 3178$), with substantial expert agreement ($\kappa > 0.75$) confirming cross-modality robustness. Successful segmentation of laser-affected polymer surfaces further confirmed cross-domain robustness. By introducing the \textit{Homogeneous Image Plane} concept, this work provides a new theoretical foundation for training-free, annotation-free segmentation. The framework operates efficiently on CPU, avoids cell staining, and is practical for live-cell imaging and biomedical applications.
♻ ☆ Algorithmic Implementation: An Introduction to a Low-Cost, GUI-Based, Semi-Unsupervised Microscopy Segmentation Framework
This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical application-particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility.
♻ ☆ Exploring Facial Biomarkers for Depression through Temporal Analysis of Action Units
Depression is characterized by persistent sadness and loss of interest, significantly impairing daily functioning and now a widespread mental disorder. Traditional diagnostic methods rely on subjective assessments, necessitating objective approaches for accurate diagnosis. Our study investigates the use of facial action units (AUs) and emotions as biomarkers for depression. We analyzed facial expressions from video data of participants classified with or without depression. Our methodology involved detailed feature extraction, mean intensity comparisons of key AUs, and the application of time series classification models. Furthermore, we employed Principal Component Analysis (PCA) and various clustering algorithms to explore the variability in emotional expression patterns. Results indicate significant differences in the intensities of AUs associated with sadness and happiness between the groups, highlighting the potential of facial analysis in depression assessment.
comment: Updated Authors
♻ ☆ Visual Affordance Prediction: Survey and Reproducibility
Affordances are the potential actions an agent can perform on an object, as observed by a camera. Visual affordance prediction is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand pose estimation. This diversity in formulations leads to inconsistent definitions that prevent fair comparisons between methods. In this paper, we propose a unified formulation of visual affordance prediction by accounting for the complete information on the objects of interest and the interaction of the agent with the objects to accomplish a task. This unified formulation allows us to comprehensively and systematically review disparate visual affordance works, highlighting strengths and limitations of both methods and datasets. We also discuss reproducibility issues, such as the unavailability of methods implementation and experimental setups details, making benchmarks for visual affordance prediction unfair and unreliable. To favour transparency, we introduce the Affordance Sheet, a document that details the solution, datasets, and validation of a method, supporting future reproducibility and fairness in the community.
comment: 18 pages, 3 figures, 13 tables. Project website at https://apicis.github.io/aff-survey/
♻ ☆ ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization NeurIPS 2025
The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
comment: NeurIPS 2025. Model weights are available at https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95
♻ ☆ Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior
Diagnosing a whole-slide image is an interactive, multi-stage process of changing magnification and moving between fields. Although recent pathology foundation models demonstrated superior performances, practical agentic systems that decide what field to examine next, adjust magnification, and deliver explainable diagnoses are still lacking. Such limitation is largely bottlenecked by data: scalable, clinically aligned supervision of expert viewing behavior that is tacit and experience-based, not documented in textbooks or internet, and therefore absent from LLM training. Here we introduce a framework designed to address this challenge through three key breakthroughs. First, the AI Session Recorder seamlessly integrates with standard whole-slide image viewers to unobtrusively record routine navigation and convert the viewer logs into standardized behavioral commands and bounding boxes. Second, a lightweight human-in-the-loop review turns AI-drafted rationales for behavioral commands into the Pathology-CoT dataset, a form of paired "where to look" and "why it matters", enabling six-fold faster labeling compared to manual constructing such Chain-of-Thought dataset. Using this behavioral data, we build Pathology-o3, a two-stage agent that first proposes important ROIs and then performs behavior-guided reasoning. On the gastrointestinal lymph-node metastasis detection task, our method achieved 100 recall on the internal validation from Stanford Medicine and 97.6 recall on an independent external validation from Sweden, exceeding the state-of-the-art OpenAI o3 model and generalizing across backbones. To our knowledge, Pathology-CoT constitutes one of the first behavior-grounded agentic systems in pathology. Turning everyday viewer logs into scalable, expert-validated supervision, our framework makes agentic pathology practical and establishes a path to human-aligned, upgradeable clinical AI.
♻ ☆ Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow ICLR 2026
Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.
comment: Under review at ICLR 2026
♻ ☆ BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping
Benthic habitat mapping is fundamental for understanding marine ecosystems, guiding conservation efforts, and supporting sustainable resource management. Yet, the scarcity of large, annotated datasets limits the development and benchmarking of machine learning models in this domain. This paper introduces a thorough multi-modal dataset, comprising about a million side-scan sonar (SSS) tiles collected along the coast of Catalonia (Spain), complemented by bathymetric maps and a set of co-registered optical images from targeted surveys using an autonomous underwater vehicle (AUV). Approximately \num{36000} of the SSS tiles have been manually annotated with segmentation masks to enable supervised fine-tuning of classification models. All the raw sensor data, together with mosaics, are also released to support further exploration and algorithm development. To address challenges in multi-sensor data fusion for AUVs, we spatially associate optical images with corresponding SSS tiles, facilitating self-supervised, cross-modal representation learning. Accompanying open-source preprocessing and annotation tools are provided to enhance accessibility and encourage research. This resource aims to establish a standardized benchmark for underwater habitat mapping, promoting advancements in autonomous seafloor classification and multi-sensor integration.
comment: Article under review by IJRR
♻ ☆ Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection
Camouflaged objects that blend into natural scenes pose significant challenges for deep-learning models to detect and synthesize. While camouflaged object detection is a crucial task in computer vision with diverse real-world applications, this research topic has been constrained by limited data availability. We propose a framework for synthesizing camouflage data to enhance the detection of camouflaged objects in natural scenes. Our approach employs a generative model to produce realistic camouflage images, which can be used to train existing object detection models. Specifically, we use a camouflage environment generator supervised by a camouflage distribution classifier to synthesize the camouflage images, which are then fed into our generator to expand the dataset. Our framework outperforms the current state-of-the-art method on three datasets (COD10k, CAMO, and CHAMELEON), demonstrating its effectiveness in improving camouflaged object detection. This approach can serve as a plug-and-play data generation and augmentation module for existing camouflaged object detection tasks and provides a novel way to introduce more diversity and distributions into current camouflage datasets.
♻ ☆ Contrastive Local Manifold Learning for No-Reference Image Quality Assessment
Image Quality Assessment (IQA) methods typically overlook local manifold structures, leading to compromised discriminative capabilities in perceptual quality evaluation. To address this limitation, we present LML-IQA, an innovative no-reference IQA (NR-IQA) approach that leverages a combination of local manifold learning and contrastive learning. Our approach first extracts multiple patches from each image and identifies the most visually salient region. This salient patch serves as a positive sample for contrastive learning, while other patches from the same image are treated as intra-class negatives to preserve local distinctiveness. Patches from different images also act as inter-class negatives to enhance feature separation. Additionally, we introduce a mutual learning strategy to improve the model's ability to recognize and prioritize visually important regions. Comprehensive experiments across eight benchmark datasets demonstrate significant performance gains over state-of-the-art methods, achieving a PLCC of 0.942 on TID2013 (compared to 0.908) and 0.977 on CSIQ (compared to 0.965).
♻ ☆ Information Topology
We introduce \emph{Information Topology}: a framework that unifies information theory and algebraic topology by treating \emph{cycle closure} as the primitive operation of inference. The starting point is the \emph{dot-cycle dichotomy}, which separates pointwise, order-sensitive fluctuations (dots) from order-invariant, predictive structure (cycles). Algebraically, closure is the cancellation of boundaries ($\partial^2=0$), which converts transient histories into stable invariants. Building on this, we derive the \emph{Structure-Before-Specificity} (SbS) principle: stable information resides in nontrivial homology classes that persist under perturbations, while high-entropy contextual details act as scaffolds. The \emph{Context-Content Uncertainty Principle} (CCUP) quantifies this balance by decomposing uncertainty into contextual spread and content precision, showing why prediction requires invariance for generalization. Measure concentration onto residual invariant manifolds explains \emph{order invariance}: when mass collapses to a narrow tube around a closed cycle, reparameterizations of micro-steps leave predictive functionals unchanged. We then define \emph{homological capacity}, the topological dual of Shannon capacity, as the sustainable number of independent informational cycles supported by a system. This capacity links dynamical (KS) entropy to structural (homological) capacity and refines Euler characteristics from a ``net'' summary to a ``gross'' count of persistent invariants. Finally, we illustrate the theory across three domains where \emph{more is different}: \textbf{visual binding}, \textbf{working memory}, and \textbf{access consciousness}. Together, these results recast inference, learning, and communication as \emph{topological stabilization}: the formation, closure, and persistence of informational cycles that make prediction robust and scalable.
Machine Learning 299
☆ Are Large Reasoning Models Interruptible?
Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.
comment: Project Page: https://dynamic-lm.github.io/
☆ Reinforced sequential Monte Carlo for amortised sampling
This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.
comment: code: https://github.com/hyeok9855/gfn-smc-jax
☆ Adversarial Attacks Leverage Interference Between Features in Superposition
Fundamental questions remain about when and why adversarial examples arise in neural networks, with competing views characterising them either as artifacts of the irregularities in the decision landscape or as products of sensitivity to non-robust input features. In this paper, we instead argue that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, we show how superposition - where networks represent more features than they have dimensions - creates arrangements of latent representations that adversaries can exploit. We demonstrate that adversarial perturbations leverage interference between superposed features, making attack patterns predictable from feature arrangements. Our framework provides a mechanistic explanation for two known phenomena: adversarial attack transferability between models with similar training regimes and class-specific vulnerability patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. We then demonstrate that these findings persist in a ViT trained on CIFAR-10. These findings reveal adversarial vulnerability can be a byproduct of networks' representational compression, rather than flaws in the learning process or non-robust inputs.
☆ QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.
comment: Code is available at https://github.com/NVlabs/QeRL
☆ Tight Regret Upper and Lower Bounds for Optimistic Hedge in Two-Player Zero-Sum Games
In two-player zero-sum games, the learning dynamic based on optimistic Hedge achieves one of the best-known regret upper bounds among strongly-uncoupled learning dynamics. With an appropriately chosen learning rate, the social and individual regrets can be bounded by $O(\log(mn))$ in terms of the numbers of actions $m$ and $n$ of the two players. This study investigates the optimality of the dependence on $m$ and $n$ in the regret of optimistic Hedge. To this end, we begin by refining existing regret analysis and show that, in the strongly-uncoupled setting where the opponent's number of actions is known, both the social and individual regret bounds can be improved to $O(\sqrt{\log m \log n})$. In this analysis, we express the regret upper bound as an optimization problem with respect to the learning rates and the coefficients of certain negative terms, enabling refined analysis of the leading constants. We then show that the existing social regret bound as well as these new social and individual regret upper bounds cannot be further improved for optimistic Hedge by providing algorithm-dependent individual regret lower bounds. Importantly, these social regret upper and lower bounds match exactly including the constant factor in the leading term. Finally, building on these results, we improve the last-iterate convergence rate and the dynamic regret of a learning dynamic based on optimistic Hedge, and complement these bounds with algorithm-dependent dynamic regret lower bounds that match the improved bounds.
comment: 29 pages, 2 figures
Diffusion Transformers with Representation Autoencoders
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.
comment: Technical Report; Project Page: https://rae-dit.github.io/
☆ Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates -- both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration -- with the right notion of diversity -- is a practical path toward discovery of new behaviors beyond sharpening.
comment: Website and code: https://rep-exp.github.io
☆ Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
☆ Chronologically Consistent Generative AI
We introduce a family of chronologically consistent, instruction-following large language models to eliminate lookahead bias. Each model is trained only on data available before a clearly defined knowledge-cutoff date, ensuring strict temporal separation from any post-cutoff data. The resulting framework offers (i) a simple, conversational chat interface, (ii) fully open, fixed model weights that guarantee replicability, and (iii) a conservative lower bound on forecast accuracy, isolating the share of predictability that survives once training leakage is removed. Together, these features provide researchers with an easy-to-use generative AI tool useful for a wide range of prediction tasks that is free of lookahead bias.
☆ Accelerated stochastic first-order method for convex optimization under heavy-tailed noise
We study convex composite optimization problems, where the objective function is given by the sum of a prox-friendly function and a convex function whose subgradients are estimated under heavy-tailed noise. Existing work often employs gradient clipping or normalization techniques in stochastic first-order methods to address heavy-tailed noise. In this paper, we demonstrate that a vanilla stochastic algorithm -- without additional modifications such as clipping or normalization -- can achieve optimal complexity for these problems. In particular, we establish that an accelerated stochastic proximal subgradient method achieves a first-order oracle complexity that is universally optimal for smooth, weakly smooth, and nonsmooth convex optimization, as well as for stochastic convex optimization under heavy-tailed noise. Numerical experiments are further provided to validate our theoretical results.
☆ An Eulerian Perspective on Straight-Line Sampling
We study dynamic measure transport for generative modeling: specifically, flows induced by stochastic processes that bridge a specified source and target distribution. The conditional expectation of the process' velocity defines an ODE whose flow map achieves the desired transport. We ask \emph{which processes produce straight-line flows} -- i.e., flows whose pointwise acceleration vanishes and thus are exactly integrable with a first-order method? We provide a concise PDE characterization of straightness as a balance between conditional acceleration and the divergence of a weighted covariance (Reynolds) tensor. Using this lens, we fully characterize affine-in-time interpolants and show that straightness occurs exactly under deterministic endpoint couplings. We also derive necessary conditions that constrain flow geometry for general processes, offering broad guidance for designing transports that are easier to integrate.
☆ MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
With the advent of DeepSeek-R1, a new wave of reinforcement learning (RL) methods has emerged that seem to unlock stronger mathematical reasoning. However, a closer look at the open-source ecosystem reveals a critical limitation: with sufficiently many draws (e.g., $\texttt{pass@1024}$), many existing base models already solve nearly all questions on widely used math benchmarks such as MATH-500 and AIME 2024. This suggests that the RL fine-tuning methods prevalent in the LLM reasoning literature largely sharpen existing solution modes rather than discovering entirely new ones. Such sharpening stands in contrast to the broader promise of RL: to foster exploration and to acquire new skills. To move beyond this plateau, we introduce MATH-Beyond (MATH-B), a benchmark deliberately constructed to defeat common open-source models of up to 8B parameters even under large sampling budgets. Improving performance on our benchmark via RL requires methods that learn to reason in ways that go beyond base model capabilities in repeated sampling. Since the problems are drawn from subsets of DAPO-Math-17K and DeepScaleR datasets, they remain topically equivalent to standard high-school math. Validating our premise, RL fine-tuned models such as Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B-Preview perform poorly on MATH-B at $\texttt{pass@1024}$, showing how existing approaches fall short on tackling harder instances. We hope MATH-B will catalyze exploration-driven RL approaches that elicit deeper reasoning capabilities. We release MATH-B at https://huggingface.co/datasets/brendel-group/MATH-Beyond.
☆ Continual Release of Densest Subgraphs: Privacy Amplification & Sublinear Space via Subsampling
We study the sublinear space continual release model for edge-differentially private (DP) graph algorithms, with a focus on the densest subgraph problem (DSG) in the insertion-only setting. Our main result is the first continual release DSG algorithm that matches the additive error of the best static DP algorithms and the space complexity of the best non-private streaming algorithms, up to constants. The key idea is a refined use of subsampling that simultaneously achieves privacy amplification and sparsification, a connection not previously formalized in graph DP. Via a simple black-box reduction to the static setting, we obtain both pure and approximate-DP algorithms with $O(\log n)$ additive error and $O(n\log n)$ space, improving both accuracy and space complexity over the previous state of the art. Along the way, we introduce graph densification in the graph DP setting, adding edges to trigger earlier subsampling, which removes the extra logarithmic factors in error and space incurred by prior work [ELMZ25]. We believe this simple idea may be of independent interest.
comment: to be published in SOSA'26
☆ NV3D: Leveraging Spatial Shape Through Normal Vector-based 3D Object Detection
Recent studies in 3D object detection for autonomous vehicles aim to enrich features through the utilization of multi-modal setups or the extraction of local patterns within LiDAR point clouds. However, multi-modal methods face significant challenges in feature alignment, and gaining features locally can be oversimplified for complex 3D object detection tasks. In this paper, we propose a novel model, NV3D, which utilizes local features acquired from voxel neighbors, as normal vectors computed per voxel basis using K-nearest neighbors (KNN) and principal component analysis (PCA). This informative feature enables NV3D to determine the relationship between the surface and pertinent target entities, including cars, pedestrians, or cyclists. During the normal vector extraction process, NV3D offers two distinct sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling, allowing elimination of up to 55% of data while maintaining performance. In addition, we applied element-wise attention fusion, which accepts voxel features as the query and value and normal vector features as the key, similar to the attention mechanism. Our method is trained on the KITTI dataset and has demonstrated superior performance in car and cyclist detection owing to their spatial shapes. In the validation set, NV3D without sampling achieves 86.60% and 80.18% mean Average Precision (mAP), greater than the baseline Voxel R-CNN by 2.61% and 4.23% mAP, respectively. With both samplings, NV3D achieves 85.54% mAP in car detection, exceeding the baseline by 1.56% mAP, despite roughly 55% of voxels being filtered out.
☆ Lecture Notes on Verifying Graph Neural Networks
In these lecture notes, we first recall the connection between graph neural networks, Weisfeiler-Lehman tests and logics such as first-order logic and graded modal logic. We then present a modal logic in which counting modalities appear in linear inequalities in order to solve verification tasks on graph neural networks. We describe an algorithm for the satisfiability problem of that logic. It is inspired from the tableau method of vanilla modal logic, extended with reasoning in quantifier-free fragment Boolean algebra with Presburger arithmetic.
Attention Factors for Statistical Arbitrage
Statistical arbitrage exploits temporal price differences between similar assets. We develop a framework to jointly identify similar assets through factors, identify mispricing and form a trading policy that maximizes risk-adjusted performance after trading costs. Our Attention Factors are conditional latent factors that are the most useful for arbitrage trading. They are learned from firm characteristic embeddings that allow for complex interactions. We identify time-series signals from the residual portfolios of our factors with a general sequence model. Estimating factors and the arbitrage trading strategy jointly is crucial to maximize profitability after trading costs. In a comprehensive empirical study we show that our Attention Factor model achieves an out-of-sample Sharpe ratio above 4 on the largest U.S. equities over a 24-year period. Our one-step solution yields an unprecedented Sharpe ratio of 2.3 net of transaction costs. We show that weak factors are important for arbitrage trading.
comment: Accepted to the 6th ACM International Conference on AI in Finance
☆ Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
The success of Transformer language models is widely credited to their dot-product attention mechanism, which interweaves a set of key design principles: mixing information across positions (enabling multi-token interactions), sequence-dependent activations (where attention weights adapt to each input), a specific mathematical form (dot-product similarities plus softmax weighting), and coupling of queries and keys to evolving hidden states (grounding attention in the current layer). However, the necessity of each of these principles remains largely untested. In this work, we systematically deconstruct attention by designing controlled variants that selectively relax these principles, applied both uniformly across all layers and in hybrid architectures where only some layers retain standard attention. Our empirical analysis reveals that mechanisms for mixing tokens are indispensable, as their absence collapses models to near-random behavior, while the exact mathematical form and sequence dependency can be substantially relaxed, especially when preserved in just a subset of layers. Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention, highlighting a cooperative effect. These findings deepen our understanding of what truly underpins attention's effectiveness and open new avenues for simplifying language models without sacrificing performance.
☆ SemCSE-Multi: Multifaceted and Decodable Embeddings for Aspect-Specific and Interpretable Scientific Domain Mapping
We propose SemCSE-Multi, a novel unsupervised framework for generating multifaceted embeddings of scientific abstracts, evaluated in the domains of invasion biology and medicine. These embeddings capture distinct, individually specifiable aspects in isolation, thus enabling fine-grained and controllable similarity assessments as well as adaptive, user-driven visualizations of scientific domains. Our approach relies on an unsupervised procedure that produces aspect-specific summarizing sentences and trains embedding models to map semantically related summaries to nearby positions in the embedding space. We then distill these aspect-specific embedding capabilities into a unified embedding model that directly predicts multiple aspect embeddings from a scientific abstract in a single, efficient forward pass. In addition, we introduce an embedding decoding pipeline that decodes embeddings back into natural language descriptions of their associated aspects. Notably, we show that this decoding remains effective even for unoccupied regions in low-dimensional visualizations, thus offering vastly improved interpretability in user-centric settings.
☆ Hierarchical Qubit-Merging Transformer for Quantum Error Correction
For reliable large-scale quantum computation, a quantum error correction (QEC) scheme must effectively resolve physical errors to protect logical information. Leveraging recent advances in deep learning, neural network-based decoders have emerged as a promising approach to enhance the reliability of QEC. We propose the Hierarchical Qubit-Merging Transformer (HQMT), a novel and general decoding framework that explicitly leverages the structural graph of stabilizer codes to learn error correlations across multiple scales. Our architecture first computes attention locally on structurally related groups of stabilizers and then systematically merges these qubit-centric representations to build a global view of the error syndrome. The proposed HQMT achieves substantially lower logical error rates for surface codes by integrating a dedicated qubit-merging layer within the transformer architecture. Across various code distances, HQMT significantly outperforms previous neural network-based QEC decoders as well as a powerful belief propagation with ordered statistics decoding (BP+OSD) baseline. This hierarchical approach provides a scalable and effective framework for surface code decoding, advancing the realization of reliable quantum computing.
comment: 6 pages, 5 figures
Diffusion-DFL: Decision-focused Diffusion Models for Stochastic Optimization
Decision-focused learning (DFL) integrates predictive modeling and optimization by training predictors to optimize the downstream decision target rather than merely minimizing prediction error. To date, existing DFL methods typically rely on deterministic point predictions, which are often insufficient to capture the intrinsic stochasticity of real-world environments. To address this challenge, we propose the first diffusion-based DFL approach, which trains a diffusion model to represent the distribution of uncertain parameters and optimizes the decision by solving a stochastic optimization with samples drawn from the diffusion model. Our contributions are twofold. First, we formulate diffusion DFL using the reparameterization trick, enabling end-to-end training through diffusion. While effective, it is memory and compute-intensive due to the need to differentiate through the diffusion sampling process. Second, we propose a lightweight score function estimator that uses only several forward diffusion passes and avoids backpropagation through the sampling. This follows from our results that backpropagating through stochastic optimization can be approximated by a weighted score function formulation. We empirically show that our diffusion DFL approach consistently outperforms strong baselines in decision quality. The source code for all experiments is available at the project repository: https://github.com/GT-KOALA/Diffusion_DFL.
☆ MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.
comment: Under Review
☆ A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation
Synthetic datasets are widely used for training urban scene recognition models, but even highly realistic renderings show a noticeable gap to real imagery. This gap is particularly pronounced when adapting to a specific target domain, such as Cityscapes, where differences in architecture, vegetation, object appearance, and camera characteristics limit downstream performance. Closing this gap with more detailed 3D modelling would require expensive asset and scene design, defeating the purpose of low-cost labelled data. To address this, we present a new framework that adapts an off-the-shelf diffusion model to a target domain using only imperfect pseudo-labels. Once trained, it generates high-fidelity, target-aligned images from semantic maps of any synthetic dataset, including low-effort sources created in hours rather than months. The method filters suboptimal generations, rectifies image-label misalignments, and standardises semantics across datasets, transforming weak synthetic data into competitive real-domain training sets. Experiments on five synthetic datasets and two real target datasets show segmentation gains of up to +8.0%pt. mIoU over state-of-the-art translation methods, making rapidly constructed synthetic datasets as effective as high-effort, time-intensive synthetic datasets requiring extensive manual design. This work highlights a valuable collaborative paradigm where fast semantic prototyping, combined with generative models, enables scalable, high-quality training data creation for urban scene understanding.
☆ Ontolearn-A Framework for Large-scale OWL Class Expression Learning in Python
In this paper, we present Ontolearn-a framework for learning OWL class expressions over large knowledge graphs. Ontolearn contains efficient implementations of recent stateof-the-art symbolic and neuro-symbolic class expression learners including EvoLearner and DRILL. A learned OWL class expression can be used to classify instances in the knowledge graph. Furthermore, Ontolearn integrates a verbalization module based on an LLM to translate complex OWL class expressions into natural language sentences. By mapping OWL class expressions into respective SPARQL queries, Ontolearn can be easily used to operate over a remote triplestore. The source code of Ontolearn is available at https://github.com/dice-group/Ontolearn.
Efficient Group Lasso Regularized Rank Regression with Data-Driven Parameter Determination
High-dimensional regression often suffers from heavy-tailed noise and outliers, which can severely undermine the reliability of least-squares based methods. To improve robustness, we adopt a non-smooth Wilcoxon score based rank objective and incorporate structured group sparsity regularization, a natural generalization of the lasso, yielding a group lasso regularized rank regression method. By extending the tuning-free parameter selection scheme originally developed for the lasso, we introduce a data-driven, simulation-based tuning rule and further establish a finite-sample error bound for the resulting estimator. On the computational side, we develop a proximal augmented Lagrangian method for solving the associated optimization problem, which eliminates the singularity issues encountered in existing methods, thereby enabling efficient semismooth Newton updates for the subproblems. Extensive numerical experiments demonstrate the robustness and effectiveness of our proposed estimator against alternatives, and showcase the scalability of the algorithm across both simulated and real-data settings.
comment: 36 pages, 4 figures, 8 tables
☆ Query-Specific GNN: A Comprehensive Graph Representation Learning Method for Retrieval Augmented Generation
Retrieval-augmented generation (RAG) has demonstrated its ability to enhance Large Language Models (LLMs) by integrating external knowledge sources. However, multi-hop questions, which require the identification of multiple knowledge targets to form a synthesized answer, raise new challenges for RAG systems. Under the multi-hop settings, existing methods often struggle to fully understand the questions with complex semantic structures and are susceptible to irrelevant noise during the retrieval of multiple information targets. To address these limitations, we propose a novel graph representation learning framework for multi-hop question retrieval. We first introduce a Multi-information Level Knowledge Graph (Multi-L KG) to model various information levels for a more comprehensive understanding of multi-hop questions. Based on this, we design a Query-Specific Graph Neural Network (QSGNN) for representation learning on the Multi-L KG. QSGNN employs intra/inter-level message passing mechanisms, and in each message passing the information aggregation is guided by the query, which not only facilitates multi-granular information aggregation but also significantly reduces the impact of noise. To enhance its ability to learn robust representations, we further propose two synthesized data generation strategies for pre-training the QSGNN. Extensive experimental results demonstrate the effectiveness of our framework in multi-hop scenarios, especially in high-hop questions the improvement can reach 33.8\%. The code is available at: https://github.com/Jerry2398/QSGNN.
☆ Automatic Music Sample Identification with Multi-Track Contrastive Learning
Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and retrieving the material from which it originates. To do so, we adopt a self-supervised learning approach that leverages a multi-track dataset to create positive pairs of artificial mixes, and design a novel contrastive learning objective. We show that such method significantly outperforms previous state-of-the-art baselines, that is robust to various genres, and that scales well when increasing the number of noise songs in the reference database. In addition, we extensively analyze the contribution of the different components of our training pipeline and highlight, in particular, the need for high-quality separated stems for this task.
☆ Knowledge-Guided Machine Learning Models to Upscale Evapotranspiration in the U.S. Midwest
Evapotranspiration (ET) plays a critical role in the land-atmosphere interactions, yet its accurate quantification across various spatiotemporal scales remains a challenge. In situ measurement approaches, like eddy covariance (EC) or weather station-based ET estimation, allow for measuring ET at a single location. Agricultural uses of ET require estimates for each field over broad areas, making it infeasible to deploy sensing systems at each location. This study integrates tree-based and knowledge-guided machine learning (ML) techniques with multispectral remote sensing data, griddled meteorology and EC data to upscale ET across the Midwest United States. We compare four tree-based models - Random Forest, CatBoost, XGBoost, LightGBM - and a simple feed-forward artificial neural network in combination with features engineered using knowledge-guided ML principles. Models were trained and tested on EC towers located in the Midwest of the United States using k-fold cross validation with k=5 and site-year, biome stratified train-test split to avoid data leakage. Results show that LightGBM with knowledge-guided features outperformed other methods with an R2=0.86, MSE=14.99 W m^-2 and MAE = 8.82 W m^-2 according to grouped k-fold validation (k=5). Feature importance analysis shows that knowledge-guided features were most important for predicting evapotranspiration. Using the best performing model, we provide a data product at 500 m spatial and one-day temporal resolution for gridded ET for the period of 2019-2024. Intercomparison between the new gridded product and state-level weather station-based ET estimates show best-in-class correspondence.
☆ Learning to Make MISTAKEs: Modeling Incorrect Student Thinking And Key Errors
Research on reasoning in language models (LMs) predominantly focuses on improving the correctness of their outputs. But some important applications require modeling reasoning patterns that are incorrect. For example, automated systems that can reason about and simulate student errors are useful for providing real-time feedback in the classroom or offline practice for educators-in-training. This paper presents a new method, MISTAKE, that (1) constructs high-quality synthetic examples of reasoning errors by leveraging cycle consistency between incorrect answers and latent misconceptions; and (2) uses the generated data to learn models for student simulation, misconception classification, and answer generation. We evaluate MISTAKE on three educational tasks and find that it results in (1) higher accuracy when simulating incorrect student answers based on specific misconceptions, (2) increased performance inferring latent misconceptions from observed incorrect answers, and (3) higher alignment with expert-written distractor answers when generating incorrect answers (e.g., for multiple-choice tests).
☆ Context-Aware Model-Based Reinforcement Learning for Autonomous Racing
Autonomous vehicles have shown promising potential to be a groundbreaking technology for improving the safety of road users. For these vehicles, as well as many other safety-critical robotic technologies, to be deployed in real-world applications, we require algorithms that can generalize well to unseen scenarios and data. Model-based reinforcement learning algorithms (MBRL) have demonstrated state-of-the-art performance and data efficiency across a diverse set of domains. However, these algorithms have also shown susceptibility to changes in the environment and its transition dynamics. In this work, we explore the performance and generalization capabilities of MBRL algorithms for autonomous driving, specifically in the simulated autonomous racing environment, Roboracer (formerly F1Tenth). We frame the head-to-head racing task as a learning problem using contextual Markov decision processes and parameterize the driving behavior of the adversaries using the context of the episode, thereby also parameterizing the transition and reward dynamics. We benchmark the behavior of MBRL algorithms in this environment and propose a novel context-aware extension of the existing literature, cMask. We demonstrate that context-aware MBRL algorithms generalize better to out-of-distribution adversary behaviors relative to context-free approaches. We also demonstrate that cMask displays strong generalization capabilities, as well as further performance improvement relative to other context-aware MBRL approaches when racing against adversaries with in-distribution behaviors.
comment: Accepted to IEEE ICAR 2025
☆ Offline Reinforcement Learning with Generative Trajectory Policies ICLR 2026
Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.
comment: Preprint. Under review at ICLR 2026
☆ ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.
☆ How Reinforcement Learning After Next-Token Prediction Facilitates Learning
Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long ``chain-of-thought'' sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$. Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.
☆ Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
☆ Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware
Integer AI inference significantly reduces computational complexity in embedded systems. Quantization-aware training (QAT) helps mitigate accuracy degradation associated with post-training quantization but still overlooks the impact of integer rescaling during inference, which is a hardware costly operation in integer-only AI inference. This work shows that rescaling cost can be dramatically reduced post-training, by applying a stronger quantization to the rescale multiplicands at no model-quality loss. Furthermore, we introduce Rescale-Aware Training, a fine tuning method for ultra-low bit-width rescaling multiplicands. Experiments show that even with 8x reduced rescaler widths, the full accuracy is preserved through minimal incremental retraining. This enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems.
comment: Submitted to IEEE Embedded Systems Letters
☆ Coordinated Strategies in Realistic Air Combat by Hierarchical Multi-Agent Reinforcement Learning
Achieving mission objectives in a realistic simulation of aerial combat is highly challenging due to imperfect situational awareness and nonlinear flight dynamics. In this work, we introduce a novel 3D multi-agent air combat environment and a Hierarchical Multi-Agent Reinforcement Learning framework to tackle these challenges. Our approach combines heterogeneous agent dynamics, curriculum learning, league-play, and a newly adapted training algorithm. To this end, the decision-making process is organized into two abstraction levels: low-level policies learn precise control maneuvers, while high-level policies issue tactical commands based on mission objectives. Empirical results show that our hierarchical approach improves both learning efficiency and combat performance in complex dogfight scenarios.
comment: 2025 IEEE International Conference on Agentic AI (ICA)
☆ Differentiable Fast Top-K Selection for Large-Scale Recommendation
Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77\% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.
comment: 12 pages, 5 figures
☆ Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers
Modern learning systems increasingly rely on amortized learning - the idea of reusing computation or inductive biases shared across tasks to enable rapid generalization to novel problems. This principle spans a range of approaches, including meta-learning, in-context learning, prompt tuning, learned optimizers and more. While motivated by similar goals, these approaches differ in how they encode and leverage task-specific information, often provided as in-context examples. In this work, we propose a unified framework which describes how such methods differ primarily in the aspects of learning they amortize - such as initializations, learned updates, or predictive mappings - and how they incorporate task data at inference. We introduce a taxonomy that categorizes amortized models into parametric, implicit, and explicit regimes, based on whether task adaptation is externalized, internalized, or jointly modeled. Building on this view, we identify a key limitation in current approaches: most methods struggle to scale to large datasets because their capacity to process task data at inference (e.g., context length) is often limited. To address this, we propose iterative amortized inference, a class of models that refine solutions step-by-step over mini-batches, drawing inspiration from stochastic optimization. Our formulation bridges optimization-based meta-learning with forward-pass amortization in models like LLMs, offering a scalable and extensible foundation for general-purpose task adaptation.
☆ Reconstructing 12-Lead ECG from 3-Lead ECG using Variational Autoencoder to Improve Cardiac Disease Detection of Wearable ECG Devices
Twelve-lead electrocardiograms (ECGs) are the clinical gold standard for cardiac diagnosis, providing comprehensive spatial coverage of the heart necessary to detect conditions such as myocardial infarction (MI). However, their lack of portability limits continuous and large-scale use. Three-lead ECG systems are widely used in wearable devices due to their simplicity and mobility, but they often fail to capture pathologies in unmeasured regions. To address this, we propose WearECG, a Variational Autoencoder (VAE) method that reconstructs twelve-lead ECGs from three leads: II, V1, and V5. Our model includes architectural improvements to better capture temporal and spatial dependencies in ECG signals. We evaluate generation quality using MSE, MAE, and Frechet Inception Distance (FID), and assess clinical validity via a Turing test with expert cardiologists. To further validate diagnostic utility, we fine-tune ECGFounder, a large-scale pretrained ECG model, on a multi-label classification task involving over 40 cardiac conditions, including six different myocardial infarction locations, using both real and generated signals. Experiments on the MIMIC dataset show that our method produces physiologically realistic and diagnostically informative signals, with robust performance in downstream tasks. This work demonstrates the potential of generative modeling for ECG reconstruction and its implications for scalable, low-cost cardiac screening.
comment: 24 pages, 5 figures, submitted to Nature Communications
☆ Forward-Forward Autoencoder Architectures for Energy-Efficient Wireless Communications
The application of deep learning to the area of communications systems has been a growing field of interest in recent years. Forward-forward (FF) learning is an efficient alternative to the backpropagation (BP) algorithm, which is the typically used training procedure for neural networks. Among its several advantages, FF learning does not require the communication channel to be differentiable and does not rely on the global availability of partial derivatives, allowing for an energy-efficient implementation. In this work, we design end-to-end learned autoencoders using the FF algorithm and numerically evaluate their performance for the additive white Gaussian noise and Rayleigh block fading channels. We demonstrate their competitiveness with BP-trained systems in the case of joint coding and modulation, and in a scenario where a fixed, non-differentiable modulation stage is applied. Moreover, we provide further insights into the design principles of the FF network, its training convergence behavior, and significant memory and processing time savings compared to BP-based approaches.
☆ Leveraging LLMs for Semi-Automatic Corpus Filtration in Systematic Literature Reviews
The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.
☆ FedHybrid: Breaking the Memory Wall of Federated Learning via Hybrid Tensor Management
Federated Learning (FL) emerges as a new learning paradigm that enables multiple devices to collaboratively train a shared model while preserving data privacy. However, one fundamental and prevailing challenge that hinders the deployment of FL on mobile devices is the memory limitation. This paper proposes \textit{FedHybrid}, a novel framework that effectively reduces the memory footprint during the training process while guaranteeing the model accuracy and the overall training progress. Specifically, \textit{FedHybrid} first selects the participating devices for each training round by jointly evaluating their memory budget, computing capability, and data diversity. After that, it judiciously analyzes the computational graph and generates an execution plan for each selected client in order to meet the corresponding memory budget while minimizing the training delay through employing a hybrid of recomputation and compression techniques according to the characteristic of each tensor. During the local training process, \textit{FedHybrid} carries out the execution plan with a well-designed activation compression technique to effectively achieve memory reduction with minimum accuracy loss. We conduct extensive experiments to evaluate \textit{FedHybrid} on both simulation and off-the-shelf mobile devices. The experiment results demonstrate that \textit{FedHybrid} achieves up to a 39.1\% increase in model accuracy and a 15.5$\times$ reduction in wall clock time under various memory budgets compared with the baselines.
comment: Sensys 2024
☆ Medical Interpretability and Knowledge Maps of Large Language Models
We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers. In addition, we find several interesting phenomena: (i) age is often encoded in a non-linear and sometimes discontinuous manner at intermediate layers in the models, (ii) the disease progression representation is non-monotonic and circular at certain layers of the model, (iii) in Llama3.3-70B, drugs cluster better by medical specialty rather than mechanism of action, especially for Llama3.3-70B and (iv) Gemma3-27B and MedGemma-27B have activations that collapse at intermediate layers but recover by the final layers. These results can guide future research on fine-tuning, un-learning or de-biasing LLMs for medical tasks by suggesting at which layers in the model these techniques should be applied.
comment: 29 pages, 34 figures, 5 tables
☆ Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.
☆ Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks NeurIPS 2025
Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $\lambda$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $\lambda$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam's generalization performance.
comment: 71 pages, 12 figures, NeurIPS 2025
☆ Multi-View Graph Feature Propagation for Privacy Preservation and Feature Sparsity
Graph Neural Networks (GNNs) have demonstrated remarkable success in node classification tasks over relational data, yet their effectiveness often depends on the availability of complete node features. In many real-world scenarios, however, feature matrices are highly sparse or contain sensitive information, leading to degraded performance and increased privacy risks. Furthermore, direct exposure of information can result in unintended data leakage, enabling adversaries to infer sensitive information. To address these challenges, we propose a novel Multi-view Feature Propagation (MFP) framework that enhances node classification under feature sparsity while promoting privacy preservation. MFP extends traditional Feature Propagation (FP) by dividing the available features into multiple Gaussian-noised views, each propagating information independently through the graph topology. The aggregated representations yield expressive and robust node embeddings. This framework is novel in two respects: it introduces a mechanism that improves robustness under extreme sparsity, and it provides a principled way to balance utility with privacy. Extensive experiments conducted on graph datasets demonstrate that MFP outperforms state-of-the-art baselines in node classification while substantially reducing privacy leakage. Moreover, our analysis demonstrates that propagated outputs serve as alternative imputations rather than reconstructions of the original features, preserving utility without compromising privacy. A comprehensive sensitivity analysis further confirms the stability and practical applicability of MFP across diverse scenarios. Overall, MFP provides an effective and privacy-aware framework for graph learning in domains characterized by missing or sensitive features.
☆ Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.
☆ Event-Aware Prompt Learning for Dynamic Graphs
Real-world graph typically evolve via a series of events, modeling dynamic interactions between objects across various domains. For dynamic graph learning, dynamic graph neural networks (DGNNs) have emerged as popular solutions. Recently, prompt learning methods have been explored on dynamic graphs. However, existing methods generally focus on capturing the relationship between nodes and time, while overlooking the impact of historical events. In this paper, we propose EVP, an event-aware dynamic graph prompt learning framework that can serve as a plug-in to existing methods, enhancing their ability to leverage historical events knowledge. First, we extract a series of historical events for each node and introduce an event adaptation mechanism to align the fine-grained characteristics of these events with downstream tasks. Second, we propose an event aggregation mechanism to effectively integrate historical knowledge into node representations. Finally, we conduct extensive experiments on four public datasets to evaluate and analyze EVP.
comment: Under review
☆ DiffStyleTS: Diffusion Model for Style Transfer in Time Series
Style transfer combines the content of one signal with the style of another. It supports applications such as data augmentation and scenario simulation, helping machine learning models generalize in data-scarce domains. While well developed in vision and language, style transfer methods for time series data remain limited. We introduce DiffTSST, a diffusion-based framework that disentangles a time series into content and style representations via convolutional encoders and recombines them through a self-supervised attention-based diffusion process. At inference, encoders extract content and style from two distinct series, enabling conditional generation of novel samples to achieve style transfer. We demonstrate both qualitatively and quantitatively that DiffTSST achieves effective style transfer. We further validate its real-world utility by showing that data augmentation with DiffTSST improves anomaly detection in data-scarce regimes.
Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
comment: 5 pages. Submitted to IEEE ICASSP 2026
☆ When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Object detection systems have traditionally relied on supervised learning with manually annotated bounding boxes, achieving high accuracy at the cost of substantial annotation investment. The emergence of Vision-Language Models (VLMs) offers an alternative paradigm enabling zero-shot detection through natural language queries, eliminating annotation requirements but operating with reduced accuracy. This paper presents the first comprehensive cost-effectiveness analysis comparing supervised detection (YOLO) with zero-shot VLM inference (Gemini Flash 2.5). Through systematic evaluation on 1,000 stratified COCO images and 200 diverse product images spanning consumer electronics and rare categories, combined with detailed Total Cost of Ownership modeling, we establish quantitative break-even thresholds governing architecture selection. Our findings reveal that supervised YOLO achieves 91.2% accuracy versus 68.5% for zero-shot Gemini on standard categories, representing a 22.7 percentage point advantage that costs $10,800 in annotation for 100-category systems. However, this advantage justifies investment only beyond 55 million inferences, equivalent to 151,000 images daily for one year. Zero-shot Gemini demonstrates 52.3% accuracy on diverse product categories (ranging from highly web-prevalent consumer electronics at 75-85% to rare specialized equipment at 25-40%) where supervised YOLO achieves 0% due to architectural constraints preventing detection of untrained classes. Cost per Correct Detection analysis reveals substantially lower per-detection costs for Gemini ($0.00050 vs $0.143) at 100,000 inferences despite accuracy deficits. We develop decision frameworks demonstrating that optimal architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics.
comment: 23 pages, 4 figures, 4 tables
☆ $Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization
Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.
comment: Accepted by NeruIPS2025
☆ LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy leveraging temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model's attention patterns, enabling precise identification of critical KVs. Furthermore, to boost efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluations show that LouisKV achieves up to 4.7$\times$ speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.
☆ Network-Optimised Spiking Neural Network (NOS) Scheduling for 6G O-RAN: Spectral Margin and Delay-Tail Control
This work presents a Network-Optimised Spiking (NOS) delay-aware scheduler for 6G radio access. The scheme couples a bounded two-state kernel to a clique-feasible proportional-fair (PF) grant head: the excitability state acts as a finite-buffer proxy, the recovery state suppresses repeated grants, and neighbour pressure is injected along the interference graph via delayed spikes. A small-signal analysis yields a delay-dependent threshold $k_\star(\Delta)$ and a spectral margin $\delta = k_\star(\Delta) - gH\rho(W)$ that compress topology, controller gain, and delay into a single design parameter. Under light assumptions on arrivals, we prove geometric ergodicity for $\delta>0$ and derive sub-Gaussian backlog and delay tail bounds with exponents proportional to $\delta$. A numerical study, aligned with the analysis and a DU compute budget, compares NOS with PF and delayed backpressure (BP) across interference topologies over a $5$--$20$\,ms delay sweep. With a single gain fixed at the worst spectral radius, NOS sustains higher utilisation and a smaller 99.9th-percentile delay while remaining clique-feasible on integer PRBs.
comment: 6 pages, 5 figures, 1 table
☆ Gym-TORAX: Open-source software for integrating RL with plasma control simulators
This paper presents Gym-TORAX, a Python package enabling the implementation of Reinforcement Learning (RL) environments for simulating plasma dynamics and control in tokamaks. Users define succinctly a set of control actions and observations, and a control objective from which Gym-TORAX creates a Gymnasium environment that wraps TORAX for simulating the plasma dynamics. The objective is formulated through rewards depending on the simulated state of the plasma and control action to optimize specific characteristics of the plasma, such as performance and stability. The resulting environment instance is then compatible with a wide range of RL algorithms and libraries and will facilitate RL research in plasma control. In its current version, one environment is readily available, based on a ramp-up scenario of the International Thermonuclear Experimental Reactor (ITER).
☆ Vision-LLMs for Spatiotemporal Traffic Forecasting
Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While Large Language Models (LLMs) have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending LLMs to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of LLMs in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with Supervised Fine-Tuning (SFT) and then further optimized for predictive accuracy using Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.
☆ ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models
We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation's policies/principles as directions to move on a model's information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model's CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single informationgeometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability
comment: 52 pages, 10 figures
☆ SeFEF: A Seizure Forecasting Evaluation Framework
The lack of standardization in seizure forecasting slows progress in the field and limits the clinical translation of forecasting models. In this work, we introduce a Python-based framework aimed at streamlining the development, assessment, and documentation of individualized seizure forecasting algorithms. The framework automates data labeling, cross-validation splitting, forecast post-processing, performance evaluation, and reporting. It supports various forecasting horizons and includes a model card that documents implementation details, training and evaluation settings, and performance metrics. Three different models were implemented as a proof-of-concept. The models leveraged features extracted from time series data and seizure periodicity. Model performance was assessed using time series cross-validation and key deterministic and probabilistic metrics. Implementation of the three models was successful, demonstrating the flexibility of the framework. The results also emphasize the importance of careful model interpretation due to variations in probability scaling, calibration, and subject-specific differences. Although formal usability metrics were not recorded, empirical observations suggest reduced development time and methodological consistency, minimizing unintentional variations that could affect the comparability of different approaches. As a proof-of-concept, this validation is inherently limited, relying on a single-user experiment without statistical analyses or replication across independent datasets. At this stage, our objective is to make the framework publicly available to foster community engagement, facilitate experimentation, and gather feedback. In the long term, we aim to contribute to the establishment of a consensus on a standardized methodology for the development and validation of seizure forecasting algorithms in people with epilepsy.
comment: main document: 14 pages, 9 figures, 2 tables; appendix: 7 pages, 2 figures, 3 tables, 2 algorithms
☆ FedLoRA-Optimizer: Federated LoRA Fine-Tuning with Global and Local Optimization in Heterogeneous Data Scenarios
Federated efficient fine-tuning has emerged as an approach that leverages distributed data and computational resources across nodes to address the challenges of large-scale fine-tuning and privacy preservation. The Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large-scale pre-trained models by introducing trainable low-rank matrices into weight updates.However, in heterogeneous data scenarios, client drift weakens the generalization of the global model, and local models often fail to meet the personalized needs of individual clients.Moreover, existing federated LoRA efficient fine-tuning techniques overlook fine-grained analysis of the tuning matrices. To address this, we conducted preliminary experiments and found that different LoRA matrices exhibit different sensitivity to changes in the direction and magnitude of their vectors.We thus propose a fine-grained federated LoRA tuning method. By fine-tuning the more sensitive directional vectors in the A matrix, which encode shared knowledge, our method learns shared features more effectively across clients and enhances global generalization. Simultaneously, by fine-tuning the more sensitive magnitude vectors in the B matrix, which encode personalized knowledge, our method better captures personalized knowledge, enabling detailed adaptation to local data. The method uses a pipeline combining global and local optimizers. Global optimization further improves local models, achieving collaborative optimization between global and local levels. This improves both the generalization ability of the global model and the personalized adaptation of local models under heterogeneous data scenarios. Experiments on Databricks-Dolly-15k and Natural Instructions with LLaMA2-7B and Deepseek-7B confirm that our method improves global performance by 0.39% and local performance by 0.59%.
☆ DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation
Loco-manipulation is a fundamental challenge for humanoid robots to achieve versatile interactions in human environments. Although recent studies have made significant progress in humanoid whole-body control, loco-manipulation remains underexplored and often relies on hard-coded task definitions or costly real-world data collection, which limits autonomy and generalization. We present DemoHLM, a framework for humanoid loco-manipulation that enables generalizable loco-manipulation on a real humanoid robot from a single demonstration in simulation. DemoHLM adopts a hierarchy that integrates a low-level universal whole-body controller with high-level manipulation policies for multiple tasks. The whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot. The manipulation policies, learned in simulation via our data generation and imitation learning pipeline, command the whole-body controller with closed-loop visual feedback to execute challenging loco-manipulation tasks. Experiments show a positive correlation between the amount of synthetic data and policy performance, underscoring the effectiveness of our data generation pipeline and the data efficiency of our approach. Real-world experiments on a Unitree G1 robot equipped with an RGB-D camera validate the sim-to-real transferability of DemoHLM, demonstrating robust performance under spatial variations across ten loco-manipulation tasks.
☆ MIEO: encoding clinical data to enhance cardiovascular event prediction
As clinical data are becoming increasingly available, machine learning methods have been employed to extract knowledge from them and predict clinical events. While promising, approaches suffer from at least two main issues: low availability of labelled data and data heterogeneity leading to missing values. This work proposes the use of self-supervised auto-encoders to efficiently address these challenges. We apply our methodology to a clinical dataset from patients with ischaemic heart disease. Patient data is embedded in a latent space, built using unlabelled data, which is then used to train a neural network classifier to predict cardiovascular death. Results show improved balanced accuracy compared to applying the classifier directly to the raw data, demonstrating that this solution is promising, especially in conditions where availability of unlabelled data could increase.
comment: Presented in the Poster Session of Computational Intelligence methods for Bioinformatics and Biostatistics (CIBB) 2025
☆ Large Language Models Are Effective Code Watermarkers
The widespread use of large language models (LLMs) and open-source code has raised ethical and security concerns regarding the distribution and attribution of source code, including unauthorized redistribution, license violations, and misuse of code for malicious purposes. Watermarking has emerged as a promising solution for source attribution, but existing techniques rely heavily on hand-crafted transformation rules, abstract syntax tree (AST) manipulation, or task-specific training, limiting their scalability and generality across languages. Moreover, their robustness against attacks remains limited. To address these limitations, we propose CodeMark-LLM, an LLM-driven watermarking framework that embeds watermark into source code without compromising its semantics or readability. CodeMark-LLM consists of two core components: (i) Semantically Consistent Embedding module that applies functionality-preserving transformations to encode watermark bits, and (ii) Differential Comparison Extraction module that identifies the applied transformations by comparing the original and watermarked code. Leveraging the cross-lingual generalization ability of LLM, CodeMark-LLM avoids language-specific engineering and training pipelines. Extensive experiments across diverse programming languages and attack scenarios demonstrate its robustness, effectiveness, and scalability.
☆ FUSE: Fast Semi-Supervised Node Embedding Learning via Structural and Label-Aware Optimization
Graph-based learning is a cornerstone for analyzing structured data, with node classification as a central task. However, in many real-world graphs, nodes lack informative feature vectors, leaving only neighborhood connectivity and class labels as available signals. In such cases, effective classification hinges on learning node embeddings that capture structural roles and topological context. We introduce a fast semi-supervised embedding framework that jointly optimizes three complementary objectives: (i) unsupervised structure preservation via scalable modularity approximation, (ii) supervised regularization to minimize intra-class variance among labeled nodes, and (iii) semi-supervised propagation that refines unlabeled nodes through random-walk-based label spreading with attention-weighted similarity. These components are unified into a single iterative optimization scheme, yielding high-quality node embeddings. On standard benchmarks, our method consistently achieves classification accuracy at par with or superior to state-of-the-art approaches, while requiring significantly less computational cost.
☆ Learning the Structure of Connection Graphs
Connection graphs (CGs) extend traditional graph models by coupling network topology with orthogonal transformations, enabling the representation of global geometric consistency. They play a key role in applications such as synchronization, Riemannian signal processing, and neural sheaf diffusion. In this work, we address the inverse problem of learning CGs directly from observed signals. We propose a principled framework based on maximum pseudo-likelihood under a consistency assumption, which enforces spectral properties linking the connection Laplacian to the underlying combinatorial Laplacian. Based on this formulation, we introduce the Structured Connection Graph Learning (SCGL) algorithm, a block-optimization procedure over Riemannian manifolds that jointly infers network topology, edge weights, and geometric structure. Our experiments show that SCGL consistently outperforms existing baselines in both topological recovery and geometric fidelity, while remaining computationally efficient.
☆ Analyzing Data Quality and Decay in Mega-Constellations: A Physics-Informed Machine Learning Approach
In the era of mega-constellations, the need for accurate and publicly available information has become fundamental for satellite operators to guarantee the safety of spacecrafts and the Low Earth Orbit (LEO) space environment. This study critically evaluates the accuracy and reliability of publicly available ephemeris data for a LEO mega-constellation - Starlink. The goal of this work is twofold: (i) compare and analyze the quality of the data against high-precision numerical propagation. (ii) Leverage Physics-Informed Machine Learning to extract relevant satellite quantities, such as non-conservative forces, during the decay process. By analyzing two months of real orbital data for approximately 1500 Starlink satellites, we identify discrepancies between high precision numerical algorithms and the published ephemerides, recognizing the use of simplified dynamics at fixed thresholds, planned maneuvers, and limitations in uncertainty propagations. Furthermore, we compare data obtained from multiple sources to track and analyze deorbiting satellites over the same period. Empirically, we extract the acceleration profile of satellites during deorbiting and provide insights relating to the effects of non-conservative forces during reentry. For non-deorbiting satellites, the position Root Mean Square Error (RMSE) was approximately 300 m, while for deorbiting satellites it increased to about 600 m. Through this in-depth analysis, we highlight potential limitations in publicly available data for accurate and robust Space Situational Awareness (SSA), and importantly, we propose a data-driven model of satellite decay in mega-constellations.
comment: 76th International Astronautical Congress
☆ Neural Weight Compression for Language Models
The efficient storage and transmission of language model weights is becoming increasingly important, as their scale and adoption continue to grow. However, as our understanding of this new data modality is limited, designing a good compression algorithm for language model weights heavily relies on manual, trial-and-error approaches. In this paper, we propose a learned compression framework that trains neural codecs directly from pretrained language model weights. Unlike conventional data (e.g., images), language model weights pose unique challenges: the sizes and shapes of weight tensors vary significantly, and the reconstruction quality must be judged by downstream model predictions rather than na\"ive MSE loss. To address this, we introduce Neural Weight Compression (NWC), a novel autoencoder-based neural codec tailored to model weight compression. The proposed method inherits the advantages of autoencoder-based codecs while incorporating three technical components: (1) column-wise tensor chunking and normalization; (2) an importance-aware training loss; (3) an inference-time error compensation mechanism guided by model outputs. Experiments on open-weight language models show that NWC achieves competitive or state-of-the-art accuracy-compression tradeoffs, with particularly strong results at 4-6 bit precisions where accuracy remains nearly on par with FP16 models.
LightPneumoNet: Lightweight Pneumonia Classifier
Effective pneumonia diagnosis is often challenged by the difficulty of deploying large, computationally expensive deep learning models in resource-limited settings. This study introduces LightPneumoNet, an efficient, lightweight convolutional neural network (CNN) built from scratch to provide an accessible and accurate diagnostic solution for pneumonia detection from chest X-rays. Our model was trained on a public dataset of 5,856 chest X-ray images. Preprocessing included image resizing to 224x224, grayscale conversion, and pixel normalization, with data augmentation (rotation, zoom, shear) to prevent overfitting. The custom architecture features four blocks of stacked convolutional layers and contains only 388,082 trainable parameters, resulting in a minimal 1.48 MB memory footprint. On the independent test set, our model delivered exceptional performance, achieving an overall accuracy of 0.942, precision of 0.92, and an F1-Score of 0.96. Critically, it obtained a sensitivity (recall) of 0.99, demonstrating a near-perfect ability to identify true pneumonia cases and minimize clinically significant false negatives. Notably, LightPneumoNet achieves this high recall on the same dataset where existing approaches typically require significantly heavier architectures or fail to reach comparable sensitivity levels. The model's efficiency enables deployment on low-cost hardware, making advanced computer-aided diagnosis accessible in underserved clinics and serving as a reliable second-opinion tool to improve patient outcomes.
comment: 13 pages (including references), 5 figures
☆ Enforcing convex constraints in Graph Neural Networks
Many machine learning applications require outputs that satisfy complex, dynamic constraints. This task is particularly challenging in Graph Neural Network models due to the variable output sizes of graph-structured data. In this paper, we introduce ProjNet, a Graph Neural Network framework which satisfies input-dependant constraints. ProjNet combines a sparse vector clipping method with the Component-Averaged Dykstra (CAD) algorithm, an iterative scheme for solving the best-approximation problem. We establish a convergence result for CAD and develop a GPU-accelerated implementation capable of handling large-scale inputs efficiently. To enable end-to-end training, we introduce a surrogate gradient for CAD that is both computationally efficient and better suited for optimization than the exact gradient. We validate ProjNet on four classes of constrained optimisation problems: linear programming, two classes of non-convex quadratic programs, and radio transmit power optimization, demonstrating its effectiveness across diverse problem settings.
☆ Discursive Circuits: How Do Language Models Understand Discourse Relations?
Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that sparse circuits ($\approx 0.2\%$ of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).
comment: Accepted to EMNLP 2025 (Main Conference); 9 pages, 8 figures, 5 tables (20 pages, 12 figures, 14 tables including references and appendices)
☆ Cross-Scale Reservoir Computing for large spatio-temporal forecasting and modeling
We propose a new reservoir computing method for forecasting high-resolution spatiotemporal datasets. By combining multi-resolution inputs from coarser to finer layers, our architecture better captures both local and global dynamics. Applied to Sea Surface Temperature data, it outperforms standard parallel reservoir models in long-term forecasting, demonstrating the effectiveness of cross-layers coupling in improving predictive accuracy. Finally, we show that the optimal network dynamics in each layer become increasingly linear, revealing the slow modes propagated to subsequent layers.
☆ Evaluating Line-level Localization Ability of Learning-based Code Vulnerability Detection Models
To address the extremely concerning problem of software vulnerability, system security is often entrusted to Machine Learning (ML) algorithms. Despite their now established detection capabilities, such models are limited by design to flagging the entire input source code function as vulnerable, rather than precisely localizing the concerned code lines. However, the detection granularity is crucial to support human operators during software development, ensuring that such predictions reflect the true code semantics to help debug, evaluate, and fix the detected vulnerabilities. To address this issue, recent work made progress toward improving the detector's localization ability, thus narrowing down the vulnerability detection "window" and providing more fine-grained predictions. Such approaches, however, implicitly disregard the presence of spurious correlations and biases in the data, which often predominantly influence the performance of ML algorithms. In this work, we investigate how detectors comply with this requirement by proposing an explainability-based evaluation procedure. Our approach, defined as Detection Alignment (DA), quantifies the agreement between the input source code lines that most influence the prediction and the actual localization of the vulnerability as per the ground truth. Through DA, which is model-agnostic and adaptable to different detection tasks, not limited to our use case, we analyze multiple learning-based vulnerability detectors and datasets. As a result, we show how the predictions of such models are consistently biased by non-vulnerable lines, ultimately highlighting the high impact of biases and spurious correlations. The code is available at https://github.com/pralab/vuln-localization-eval.
comment: Preprint
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs
Structured sparsity enables deploying large language models (LLMs) on resource-constrained systems. Approaches like dense-to-sparse fine-tuning are particularly compelling, achieving remarkable structured sparsity by reducing the model size by over 6.7x, while still maintaining acceptable accuracy. Despite this reduction, LLM inference, especially the decode stage being inherently memory-bound, is extremely expensive on conventional Von-Neumann architectures. Compute-in-memory (CIM) architectures mitigate this by performing computations directly in memory, and when paired with sparse LLMs, enable storing and computing the entire model in memory, eliminating the data movement on the off-chip bus and improving efficiency. Nonetheless, naively mapping sparse matrices onto CIM arrays leads to poor array utilization and diminished computational efficiency. In this paper, we present an automated framework with novel mapping and scheduling strategies to accelerate sparse LLM inference on CIM accelerators. By exploiting block-diagonal sparsity, our approach improves CIM array utilization by over 50%, achieving more than 4x reduction in both memory footprint and the number of required floating-point operations.
comment: 8 pages, to appear in IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC)
☆ Protein as a Second Language for LLMs ICLR 2026
Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.
comment: Main paper: 9 pages, 6 figures. With references and appendix: 18 pages, 9 figures total. Submitted to ICLR 2026 (under review)
☆ Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains?
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains underexplored. In this work, we investigate the cross-domain generalization of an LLM agent equipped with a code interpreter tool, which is exclusively trained on mathematical problem-solving tasks. Despite the restricted training domain, we evaluate the agent's performance across several distinct reasoning domains. The results reveal that RL-based tool usage learned from mathematical tasks can be effectively transferred to complex tasks in other domains, enabling great task performance and high token efficiency. To facilitate this cross-domain transfer, we propose a Tool Generalization Reinforcement Learning (TGRL) framework designed to promote domain-agnostic learning and skill migration, encompassing: (i) a standardized tool interface that abstracts domain-specific nuances through consistent formatting and explicit termination, fostering transferable invocation patterns; (ii) a dual-component reward system that decomposes rewards to incentivize generalizable behaviors like tool efficiency and reasoning abstraction, ensuring alignment and robustness across domain shifts; and (iii) an XML-based prompt template that separates thinking, tool calls, and responses to encourage modular, domain-invariant planning and coherent multi-turn interactions. Extensive experiments across diverse benchmarks validate our approach, achieving state-of-the-art performance and highlighting the cross-domain potential of Tool RL for LLM reasoning.
☆ Machine Learning-Integrated Hybrid Fluid-Kinetic Framework for Quantum Electrodynamic Laser Plasma Simulations
High-intensity laser plasma interactions create complex computational problems because they involve both fluid and kinetic regimes, which need models that maintain physical precision while keeping computational speed. The research introduces a machine learning-based three-dimensional hybrid fluid-particle-in-cell (PIC) system, which links relativistic plasma behavior to automatic regime transitions. The technique employs fluid approximations for stable areas but activates the PIC solver when SwitchNet directs it to unstable sections through its training on physics-based synthetic data. The model uses a smooth transition between Ammosov-Delone-Krainov (ADK) tunneling and multiphoton ionization rates to simulate ionization, while Airy-function approximations simulate quantum electrodynamic (QED) effects for radiation reaction and pair production. The convolutional neural network applies energy conservation through physics-based loss functions, which operate on normalized fields per channel. Monte Carlo dropout provides uncertainty measurement. The hybrid model produces precise predictions with coefficient of determination (R^2) values above 0.95 and mean squared errors below 10^-4 for all field components. This adaptive approach enhances the accuracy and scalability of laser-plasma simulations, providing a unified predictive framework for high-energy-density and particle acceleration applications.
☆ EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling
With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.
☆ PAC-Bayesian Bounds on Constrained f-Entropic Risk Measures
PAC generalization bounds on the risk, when expressed in terms of the expected loss, are often insufficient to capture imbalances between subgroups in the data. To overcome this limitation, we introduce a new family of risk measures, called constrained f-entropic risk measures, which enable finer control over distributional shifts and subgroup imbalances via f-divergences, and include the Conditional Value at Risk (CVaR), a well-known risk measure. We derive both classical and disintegrated PAC-Bayesian generalization bounds for this family of risks, providing the first disintegratedPAC-Bayesian guarantees beyond standard risks. Building on this theory, we design a self-bounding algorithm that minimizes our bounds directly, yielding models with guarantees at the subgroup level. Finally, we empirically demonstrate the usefulness of our approach.
☆ ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces ICML 2025
Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations -- gradient fusion and chunking -- enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method, Renee without compromising accuracy.
comment: Accepted to ICML 2025
☆ Beyond single-model XAI: aggregating multi-model explanations for enhanced trustworthiness
The use of Artificial Intelligence (AI) models in real-world and high-risk applications has intensified the discussion about their trustworthiness and ethical usage, from both a technical and a legislative perspective. The field of eXplainable Artificial Intelligence (XAI) addresses this challenge by proposing explanations that bring to light the decision-making processes of complex black-box models. Despite being an essential property, the robustness of explanations is often an overlooked aspect during development: only robust explanation methods can increase the trust in the system as a whole. This paper investigates the role of robustness through the usage of a feature importance aggregation derived from multiple models ($k$-nearest neighbours, random forest and neural networks). Preliminary results showcase the potential in increasing the trustworthiness of the application, while leveraging multiple model's predictive power.
comment: Accepted at the European Workshop on Trustworthy Artificial Intelligence (TRUST-AI), co-located within ECAI 2025
☆ Emergence of hybrid computational dynamics through reinforcement learning
Understanding how learning algorithms shape the computational strategies that emerge in neural networks remains a fundamental challenge in machine intelligence. While network architectures receive extensive attention, the role of the learning paradigm itself in determining emergent dynamics remains largely unexplored. Here we demonstrate that reinforcement learning (RL) and supervised learning (SL) drive recurrent neural networks (RNNs) toward fundamentally different computational solutions when trained on identical decision-making tasks. Through systematic dynamical systems analysis, we reveal that RL spontaneously discovers hybrid attractor architectures, combining stable fixed-point attractors for decision maintenance with quasi-periodic attractors for flexible evidence integration. This contrasts sharply with SL, which converges almost exclusively to simpler fixed-point-only solutions. We further show that RL sculpts functionally balanced neural populations through a powerful form of implicit regularization -- a structural signature that enhances robustness and is conspicuously absent in the more heterogeneous solutions found by SL-trained networks. The prevalence of these complex dynamics in RL is controllably modulated by weight initialization and correlates strongly with performance gains, particularly as task complexity increases. Our results establish the learning algorithm as a primary determinant of emergent computation, revealing how reward-based optimization autonomously discovers sophisticated dynamical mechanisms that are less accessible to direct gradient-based optimization. These findings provide both mechanistic insights into neural computation and actionable principles for designing adaptive AI systems.
comment: 22 pages, 11 figures
☆ Enhanced Sampling for Efficient Learning of Coarse-Grained Machine Learning Potentials
Coarse-graining (CG) enables molecular dynamics (MD) simulations of larger systems and longer timescales that are otherwise infeasible with atomistic models. Machine learning potentials (MLPs), with their capacity to capture many-body interactions, can provide accurate approximations of the potential of mean force (PMF) in CG models. Current CG MLPs are typically trained in a bottom-up manner via force matching, which in practice relies on configurations sampled from the unbiased equilibrium Boltzmann distribution to ensure thermodynamic consistency. This convention poses two key limitations: first, sufficiently long atomistic trajectories are needed to reach convergence; and second, even once equilibrated, transition regions remain poorly sampled. To address these issues, we employ enhanced sampling to bias along CG degrees of freedom for data generation, and then recompute the forces with respect to the unbiased potential. This strategy simultaneously shortens the simulation time required to produce equilibrated data and enriches sampling in transition regions, while preserving the correct PMF. We demonstrate its effectiveness on the M\"uller-Brown potential and capped alanine, achieving notable improvements. Our findings support the use of enhanced sampling for force matching as a promising direction to improve the accuracy and reliability of CG MLPs.
☆ torchsom: The Reference PyTorch Library for Self-Organizing Maps
This paper introduces torchsom, an open-source Python library that provides a reference implementation of the Self-Organizing Map (SOM) in PyTorch. This package offers three main features: (i) dimensionality reduction, (ii) clustering, and (iii) friendly data visualization. It relies on a PyTorch backend, enabling (i) fast and efficient training of SOMs through GPU acceleration, and (ii) easy and scalable integrations with PyTorch ecosystem. Moreover, torchsom follows the scikit-learn API for ease of use and extensibility. The library is released under the Apache 2.0 license with 90% test coverage, and its source code and documentation are available at https://github.com/michelin/TorchSOM.
comment: 4 mains pages with 2 tables, 4 pages of references, 15 pages of appendices with 13 figures and 3 tables
☆ A Comprehensive Forecasting-Based Framework for Time Series Anomaly Detection: Benchmarking on the Numenta Anomaly Benchmark (NAB)
Time series anomaly detection is critical for modern digital infrastructures, yet existing methods lack systematic cross-domain evaluation. We present a comprehensive forecasting-based framework unifying classical methods (Holt-Winters, SARIMA) with deep learning architectures (LSTM, Informer) under a common residual-based detection interface. Our modular pipeline integrates preprocessing (normalization, STL decomposition), four forecasting models, four detection methods, and dual evaluation through forecasting metrics (MAE, RMSE, PCC) and detection metrics (Precision, Recall, F1, AUC). We conduct the first complete evaluation on the Numenta Anomaly Benchmark (58 datasets, 7 categories) with 232 model training runs and 464 detection evaluations achieving 100\% success rate. LSTM achieves best performance (F1: 0.688, ranking first or second on 81\% of datasets) with exceptional correlation on complex patterns (PCC: 0.999). Informer provides competitive accuracy (F1: 0.683) with 30\% faster training. Classical methods achieve perfect predictions on simple synthetic data with 60 lower cost but show 2-3 worse F1-scores on real-world datasets. Forecasting quality dominates detection performance: differences between detection methods (F1: 0.621-0.688) are smaller than between forecasting models (F1: 0.344-0.688). Our findings provide evidence-based guidance: use LSTM for complex patterns, Informer for efficiency-critical deployments, and classical methods for simple periodic data with resource constraints. The complete implementation and results establish baselines for future forecasting-based anomaly detection research.
☆ DUAL: Learning Diverse Kernels for Aggregated Two-sample and Independence Testing
To adapt kernel two-sample and independence testing to complex structured data, aggregation of multiple kernels is frequently employed to boost testing power compared to single-kernel tests. However, we observe a phenomenon that directly maximizing multiple kernel-based statistics may result in highly similar kernels that capture highly overlapping information, limiting the effectiveness of aggregation. To address this, we propose an aggregated statistic that explicitly incorporates kernel diversity based on the covariance between different kernels. Moreover, we identify a fundamental challenge: a trade-off between the diversity among kernels and the test power of individual kernels, i.e., the selected kernels should be both effective and diverse. This motivates a testing framework with selection inference, which leverages information from the training phase to select kernels with strong individual performance from the learned diverse kernel pool. We provide rigorous theoretical statements and proofs to show the consistency on the test power and control of Type-I error, along with asymptotic analysis of the proposed statistics. Lastly, we conducted extensive empirical experiments demonstrating the superior performance of our proposed approach across various benchmarks for both two-sample and independence testing.
☆ Test-Time Adaptation by Causal Trimming NeurIPS 2025
Test-time adaptation aims to improve model robustness under distribution shifts by adapting models with access to unlabeled target samples. A primary cause of performance degradation under such shifts is the model's reliance on features that lack a direct causal relationship with the prediction target. We introduce Test-time Adaptation by Causal Trimming (TACT), a method that identifies and removes non-causal components from representations for test distributions. TACT applies data augmentations that preserve causal features while varying non-causal ones. By analyzing the changes in the representations using Principal Component Analysis, TACT identifies the highest variance directions associated with non-causal features. It trims the representations by removing their projections on the identified directions, and uses the trimmed representations for the predictions. During adaptation, TACT continuously tracks and refines these directions to get a better estimate of non-causal features. We theoretically analyze the effectiveness of this approach and empirically validate TACT on real-world out-of-distribution benchmarks. TACT consistently outperforms state-of-the-art methods by a significant margin.
comment: Accepted to the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025); Code is available at https://github.com/NancyQuris/TACT
Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer
Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student's learned representations by feeding them back into the frozen teacher's prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.
☆ Refining Hybrid Genetic Search for CVRP via Reinforcement Learning-Finetuned LLM
While large language models (LLMs) are increasingly used as automated heuristic designers for vehicle routing problems (VRPs), current state-of-the-art methods predominantly rely on prompting massive, general-purpose models like GPT-4. This work challenges that paradigm by demonstrating that a smaller, specialized LLM, when meticulously fine-tuned, can generate components that surpass expert-crafted heuristics within advanced solvers. We propose RFTHGS, a novel Reinforcement learning (RL) framework for Fine-Tuning a small LLM to generate high-performance crossover operators for the Hybrid Genetic Search (HGS) solver, applied to the Capacitated VRP (CVRP). Our method employs a multi-tiered, curriculum-based reward function that progressively guides the LLM to master generating first compilable, then executable, and finally, superior-performing operators that exceed human expert designs. This is coupled with an operator caching mechanism that discourages plagiarism and promotes diversity during training. Comprehensive experiments show that our fine-tuned LLM produces crossover operators which significantly outperform the expert-designed ones in HGS. The performance advantage remains consistent, generalizing from small-scale instances to large-scale problems with up to 1000 nodes. Furthermore, RFTHGS exceeds the performance of leading neuro-combinatorial baselines, prompt-based methods, and commercial LLMs such as GPT-4o and GPT-4o-mini.
☆ PhysioME: A Robust Multimodal Self-Supervised Framework for Physiological Signals with Missing Modalities
Missing or corrupted modalities are common in physiological signal-based medical applications owing to hardware constraints or motion artifacts. However, most existing methods assume the availability of all modalities, resulting in substantial performance degradation in the absence of any modality. To overcome this limitation, this study proposes PhysioME, a robust framework designed to ensure reliable performance under missing modality conditions. PhysioME adopts: (1) a multimodal self-supervised learning approach that combines contrastive learning with masked prediction; (2) a Dual-PathNeuroNet backbone tailored to capture the temporal dynamics of each physiological signal modality; and (3) a restoration decoder that reconstructs missing modality tokens, enabling flexible processing of incomplete inputs. The experimental results show that PhysioME achieves high consistency and generalization performance across various missing modality scenarios. These findings highlight the potential of PhysioME as a reliable tool for supporting clinical decision-making in real-world settings with imperfect data availability.
comment: 9 pages, 2 figures
☆ Graph Neural Network-Based Multicast Routing for On-Demand Streaming Services in 6G Networks
The increase of bandwidth-intensive applications in sixth-generation (6G) wireless networks, such as real-time volumetric streaming and multi-sensory extended reality, demands intelligent multicast routing solutions capable of delivering differentiated quality-of-service (QoS) at scale. Traditional shortest-path and multicast routing algorithms are either computationally prohibitive or structurally rigid, and they often fail to support heterogeneous user demands, leading to suboptimal resource utilization. Neural network-based approaches, while offering improved inference speed, typically lack topological generalization and scalability. To address these limitations, this paper presents a graph neural network (GNN)-based multicast routing framework that jointly minimizes total transmission cost and supports user-specific video quality requirements. The routing problem is formulated as a constrained minimum-flow optimization task, and a reinforcement learning algorithm is developed to sequentially construct efficient multicast trees by reusing paths and adapting to network dynamics. A graph attention network (GAT) is employed as the encoder to extract context-aware node embeddings, while a long short-term memory (LSTM) module models the sequential dependencies in routing decisions. Extensive simulations demonstrate that the proposed method closely approximates optimal dynamic programming-based solutions while significantly reducing computational complexity. The results also confirm strong generalization to large-scale and dynamic network topologies, highlighting the method's potential for real-time deployment in 6G multimedia delivery scenarios. Code is available at https://github.com/UNIC-Lab/GNN-Routing.
☆ Causal Disentanglement Learning for Accurate Anomaly Detection in Multivariate Time Series
Disentangling complex causal relationships is important for accurate detection of anomalies. In multivariate time series analysis, dynamic interactions among data variables over time complicate the interpretation of causal relationships. Traditional approaches assume statistical independence between variables in unsupervised settings, whereas recent methods capture feature correlations through graph representation learning. However, their representations fail to explicitly infer the causal relationships over different time periods. To solve the problem, we propose Causally Disentangled Representation Learning for Anomaly Detection (CDRL4AD) to detect anomalies and identify their causal relationships in multivariate time series. First, we design the causal process as model input, the temporal heterogeneous graph, and causal relationships. Second, our representation identifies causal relationships over different time periods and disentangles latent variables to infer the corresponding causal factors. Third, our experiments on real-world datasets demonstrate that CDRL4AD outperforms state-of-the-art methods in terms of accuracy and root cause analysis. Fourth, our model analysis validates hyperparameter sensitivity and the time complexity of CDRL4AD. Last, we conduct a case study to show how our approach assists human experts in diagnosing the root causes of anomalies.
comment: 20 pages, 4 Figures,
☆ PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System
Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.
comment: Project website: https://why618188.github.io/physhsi/
Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction
Edge devices face significant challenges due to limited computational resources and distribution shifts, making efficient and adaptable machine learning essential. Existing test-time adaptation (TTA) methods often rely on gradient-based optimization or batch processing, which are inherently unsuitable for resource-constrained edge scenarios due to their reliance on backpropagation and high computational demands. Gradient-free alternatives address these issues but often suffer from limited learning capacity, lack flexibility, or impose architectural constraints. To overcome these limitations, we propose a novel single-instance TTA method tailored for edge devices (TED), which employs forward-only coordinate optimization in the principal subspace of latent using the covariance matrix adaptation evolution strategy (CMA-ES). By updating a compact low-dimensional vector, TED not only enhances output confidence but also aligns the latent representation closer to the source latent distribution within the latent principal subspace. This is achieved without backpropagation, keeping the model parameters frozen, and enabling efficient, forgetting-free adaptation with minimal memory and computational overhead. Experiments on image classification and keyword spotting tasks across the ImageNet and Google Speech Commands series datasets demonstrate that TED achieves state-of-the-art performance while $\textit{reducing computational complexity by up to 63 times}$, offering a practical and scalable solution for real-world edge applications. Furthermore, we successfully $\textit{deployed TED on the ZYNQ-7020 platform}$, demonstrating its feasibility and effectiveness for resource-constrained edge devices in real-world deployments.
comment: Under review
☆ Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.
☆ Robust Photoplethysmography Signal Denoising via Mamba Networks
Photoplethysmography (PPG) is widely used in wearable health monitoring, but its reliability is often degraded by noise and motion artifacts, limiting downstream applications such as heart rate (HR) estimation. This paper presents a deep learning framework for PPG denoising with an emphasis on preserving physiological information. In this framework, we propose DPNet, a Mamba-based denoising backbone designed for effective temporal modeling. To further enhance denoising performance, the framework also incorporates a scale-invariant signal-to-distortion ratio (SI-SDR) loss to promote waveform fidelity and an auxiliary HR predictor (HRP) that provides physiological consistency through HR-based supervision. Experiments on the BIDMC dataset show that our method achieves strong robustness against both synthetic noise and real-world motion artifacts, outperforming conventional filtering and existing neural models. Our method can effectively restore PPG signals while maintaining HR accuracy, highlighting the complementary roles of SI-SDR loss and HR-guided supervision. These results demonstrate the potential of our approach for practical deployment in wearable healthcare systems.
comment: 5 pages, 2 figures
☆ Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models
Diffusion models have achieved remarkable success as generative models. However, even a well-trained model can accumulate errors throughout the generation process. These errors become particularly problematic when arbitrary guidance is applied to steer samples toward desired properties, which often breaks sample fidelity. In this paper, we propose a general solution to address the off-manifold phenomenon observed in diffusion models. Our approach leverages a time predictor to estimate deviations from the desired data manifold at each timestep, identifying that a larger time gap is associated with reduced generation quality. We then design a novel guidance mechanism, `Temporal Alignment Guidance' (TAG), attracting the samples back to the desired manifold at every timestep during generation. Through extensive experiments, we demonstrate that TAG consistently produces samples closely aligned with the desired manifold at each timestep, leading to significant improvements in generation quality across various downstream tasks.
comment: 54 pages, 17 figures, 18 tables
☆ Conformal Inference for Time Series over Graphs
Trustworthy decision making in networked, dynamic environments calls for innovative uncertainty quantification substrates in predictive models for graph time series. Existing conformal prediction (CP) methods have been applied separately to multivariate time series and static graphs, but they either ignore the underlying graph topology or neglect temporal dynamics. To bridge this gap, here we develop a CP-based sequential prediction region framework tailored for graph time series. A key technical innovation is to leverage the graph structure and thus capture pairwise dependencies across nodes, while providing user-specified coverage guarantees on the predictive outcomes. We formally establish that our scheme yields an exponential shrinkage in the volume of the ellipsoidal prediction set relative to its graph-agnostic counterpart. Using real-world datasets, we demonstrate that the novel uncertainty quantification framework maintains desired empirical coverage while achieving markedly smaller (up to 80% reduction) prediction regions than existing approaches.
☆ The Easy Path to Robustness: Coreset Selection using Sample Hardness
Designing adversarially robust models from a data-centric perspective requires understanding which input samples are most crucial for learning resilient features. While coreset selection provides a mechanism for efficient training on data subsets, current algorithms are designed for clean accuracy and fall short in preserving robustness. To address this, we propose a framework linking a sample's adversarial vulnerability to its \textit{hardness}, which we quantify using the average input gradient norm (AIGN) over training. We demonstrate that \textit{easy} samples (with low AIGN) are less vulnerable and occupy regions further from the decision boundary. Leveraging this insight, we present EasyCore, a coreset selection algorithm that retains only the samples with low AIGN for training. We empirically show that models trained on EasyCore-selected data achieve significantly higher adversarial accuracy than those trained with competing coreset methods under both standard and adversarial training. As AIGN is a model-agnostic dataset property, EasyCore is an efficient and widely applicable data-centric method for improving adversarial robustness. We show that EasyCore achieves up to 7\% and 5\% improvement in adversarial accuracy under standard training and TRADES adversarial training, respectively, compared to existing coreset methods.
☆ Instruction-aware User Embedding via Synergistic Language and Representation Modeling
User representation modeling has become increasingly crucial for personalized applications, yet existing approaches struggle with generalizability across domains and sensitivity to noisy behavioral signals. We present InstructUE, an instruction-aware user embedding foundation model that leverages large language models (LLMs) to generate general and instruction-aware user representations. InstructUE introduces a multi-encoder architecture with a lightweight adapter that efficiently processes heterogeneous data from six different sources while preserving their structural characteristics. Additionally, it proposes a novel contrastive-autoregressive training framework that bridges language and representation spaces through a curated UserQA dataset. The contrastive-autoregressive training framework simultaneously leverages autoregressive learning to capture domain knowledge in language space and contrastive learning to align user-text embeddings in representation space, thereby enhancing the instruction-awareness and noise-robustness of user embeddings. Through extensive experiments on real-world applications, we demonstrate that InstructUE significantly outperforms existing methods across multiple domains including user prediction, marketing, and recommendation scenarios. Our results show that instruction-aware user modeling can effectively achieve instruction-guided denoising of user information in specific scenarios, paving the way for more generalizable and robust user representation learning.
☆ GrASP: A Generalizable Address-based Semantic Prefetcher for Scalable Transactional and Analytical Workloads
Data prefetching--loading data into the cache before it is requested--is essential for reducing I/O overhead and improving database performance. While traditional prefetchers focus on sequential patterns, recent learning-based approaches, especially those leveraging data semantics, achieve higher accuracy for complex access patterns. However, these methods often struggle with today's dynamic, ever-growing datasets and require frequent, timely fine-tuning. Privacy constraints may also restrict access to complete datasets, necessitating prefetchers that can learn effectively from samples. To address these challenges, we present GrASP, a learning-based prefetcher designed for both analytical and transactional workloads. GrASP enhances prefetching accuracy and scalability by leveraging logical block address deltas and combining query representations with result encodings. It frames prefetching as a context-aware multi-label classification task, using multi-layer LSTMs to predict delta patterns from embedded context. This delta modeling approach enables GrASP to generalize predictions from small samples to larger, dynamic datasets without requiring extensive retraining. Experiments on real-world datasets and industrial benchmarks demonstrate that GrASP generalizes to datasets 250 times larger than the training data, achieving up to 45% higher hit ratios, 60% lower I/O time, and 55% lower end-to-end query execution latency than existing baselines. On average, GrASP attains a 91.4% hit ratio, a 90.8% I/O time reduction, and a 57.1% execution latency reduction.
comment: This is a preprint version
☆ ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios
Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization--such as gender and caste--shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates--harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.
comment: 28 pages, 11 figures, 16 tables. In submission
☆ Adversarial Robustness in One-Stage Learning-to-Defer
Learning-to-Defer (L2D) enables hybrid decision-making by routing inputs either to a predictor or to external experts. While promising, L2D is highly vulnerable to adversarial perturbations, which can not only flip predictions but also manipulate deferral decisions. Prior robustness analyses focus solely on two-stage settings, leaving open the end-to-end (one-stage) case where predictor and allocation are trained jointly. We introduce the first framework for adversarial robustness in one-stage L2D, covering both classification and regression. Our approach formalizes attacks, proposes cost-sensitive adversarial surrogate losses, and establishes theoretical guarantees including $\mathcal{H}$, $(\mathcal{R }, \mathcal{F})$, and Bayes consistency. Experiments on benchmark datasets confirm that our methods improve robustness against untargeted and targeted attacks while preserving clean performance.
☆ Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization
Recent AI regulations call for data that remain useful for innovation while resistant to misuse, balancing utility with protection at the model level. Existing approaches either perturb data to make it unlearnable or retrain models to suppress transfer, but neither governs inference by unknown models, and both typically require control over training. We propose non-transferable examples (NEs), a training-free and data-agnostic input-side usage-control mechanism. We recode inputs within a model-specific low-sensitivity subspace, preserving outputs for the authorized model while reducing performance on unauthorized models through subspace misalignment. We establish formal bounds that guarantee utility for the authorized model and quantify deviation for unauthorized ones, with the Hoffman-Wielandt inequality linking degradation to spectral differences. Empirically, NEs retain performance on diverse vision backbones and state-of-the-art vision-language models under common preprocessing, whereas non-target models collapse even with reconstruction attempts. These results establish NEs as a practical means to preserve intended data utility while preventing unauthorized exploitation. Our project is available at https://trusted-system-lab.github.io/model-specificity
☆ In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.
☆ On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation
Self-supervised learning (SSL) has achieved remarkable success by learning meaningful representations without labeled data. However, a unified theoretical framework for understanding and comparing the efficiency of different SSL paradigms remains elusive. In this paper, we introduce a novel information-geometric framework to quantify representation efficiency. We define representation efficiency $\eta$ as the ratio between the effective intrinsic dimension of the learned representation space and its ambient dimension, where the effective dimension is derived from the spectral properties of the Fisher Information Matrix (FIM) on the statistical manifold induced by the encoder. Within this framework, we present a theoretical analysis of the Barlow Twins method. Under specific but natural assumptions, we prove that Barlow Twins achieves optimal representation efficiency ($\eta = 1$) by driving the cross-correlation matrix of representations towards the identity matrix, which in turn induces an isotropic FIM. This work provides a rigorous theoretical foundation for understanding the effectiveness of Barlow Twins and offers a new geometric perspective for analyzing SSL algorithms.
comment: 7 pages
☆ Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning
The capabilities of Large Vision-Language Models (LVLMs) have reached state-of-the-art on many visual reasoning tasks, including chart reasoning, yet they still falter on out-of-distribution (OOD) data, and degrade further when asked to produce their chain-of-thought (CoT) rationales, limiting explainability. We present Chart-RVR, a general framework that fine-tunes LVLMs to be more robust and explainable for chart reasoning by coupling Group Relative Policy Optimization (GRPO) with automatically verifiable rewards. Our framework comprises of three rewards that maximize: (i) correct chart-type classification, (ii) faithful chart table reconstruction, and (iii) process conformity. Applied to 3-billion-parameter LVLMs, Chart-RVR consistently outperforms standard supervised fine-tuning (SFT) on both in-distribution and out-of-distribution datasets, closing the OOD performance gap while improving rationale fidelity. The resulting models, the Chart-RVR-3B series, achieve state-of-the-art results on six chart-reasoning benchmarks spanning in-domain and OOD settings, surpassing all existing models of comparable size. Beyond accuracy, Chart-RVR yields more interpretable CoT rationales, strengthening trust and reliability - showcasing the power of verifiable rewards with GRPO for training reliable, interpretable chart-reasoning models.
comment: 23 pages
☆ Blade: A Derivative-free Bayesian Inversion Method using Diffusion Priors
Derivative-free Bayesian inversion is an important task in many science and engineering applications, particularly when computing the forward model derivative is computationally and practically challenging. In this paper, we introduce Blade, which can produce accurate and well-calibrated posteriors for Bayesian inversion using an ensemble of interacting particles. Blade leverages powerful data-driven priors based on diffusion models, and can handle nonlinear forward models that permit only black-box access (i.e., derivative-free). Theoretically, we establish a non-asymptotic convergence analysis to characterize the effects of forward model and prior estimation errors. Empirically, Blade achieves superior performance compared to existing derivative-free Bayesian inversion methods on various inverse problems, including challenging highly nonlinear fluid dynamics.
☆ Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models
While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.
comment: 20 pages, 12 figures
☆ APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning, where the Bradley-Terry (BT) objective has been recognized as simple yet powerful, specifically for pairwise preference learning. However, BT-based RMs often struggle to effectively distinguish between similar preference responses, leading to insufficient separation between preferred and non-preferred outputs. Consequently, they may easily overfit easy samples and cannot generalize well to Out-Of-Distribution (OOD) samples, resulting in suboptimal performance. To address these challenges, this paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism. Specifically, we design to dynamically adjust the RM focus on more challenging samples through margins, based on both semantic similarity and model-predicted reward differences, which is approached from a distributional perspective solvable with Optimal Transport (OT). By incorporating these factors into a principled OT cost matrix design, our adaptive margin enables the RM to better capture distributional differences between chosen and rejected responses, yielding significant improvements in performance, convergence speed, and generalization capabilities. Experimental results across multiple benchmarks demonstrate that our method outperforms several existing RM techniques, showcasing enhanced performance in both In-Distribution (ID) and OOD settings. Moreover, RLHF experiments support our practical effectiveness in better aligning LLMs with human preferences. Our code is available at https://github.com/BIRlz/APLOT
comment: EMNLP2025
☆ MC#: Mixture Compressor for Mixture-of-Experts Large Models
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation. However, preloading all experts into memory and activating multiple experts per input introduces significant computational and memory overhead, making the expert module a major contributor to model size and inference cost. To address this, we propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning by leveraging the significance of experts and tokens for aggressive compression of MoE-LLMs/VLMs. To reduce storage and loading costs, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which optimizes bit allocation via linear programming, balancing expert importance and quantization error for a Pareto-optimal trade-off between size and performance. To reduce runtime computation, Online Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a subset of experts per token, enabling fine-grained control over activation. By combining PMQ's static bit-width optimization with OTP's dynamic routing, MC# achieves extreme compression with minimal accuracy loss. On DeepSeek-VL2, MC# achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7% accuracy drop across five multimodal benchmarks. Additionally, OTP reduces expert activation over 20% with less than 1% performance degradation, demonstrating strong potential for efficient MoE-based model deployment.
comment: 15 pages, 13 figures
☆ Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
comment: 16 pages, 4 figures
☆ Comparative Evaluation of Neural Network Architectures for Generalizable Human Spatial Preference Prediction in Unseen Built Environments
The capacity to predict human spatial preferences within built environments is instrumental for developing Cyber-Physical-Social Infrastructure Systems (CPSIS). A significant challenge in this domain is the generalizability of preference models, particularly their efficacy in predicting preferences within environmental configurations not encountered during training. While deep learning models have shown promise in learning complex spatial and contextual dependencies, it remains unclear which neural network architectures are most effective at generalizing to unseen layouts. To address this, we conduct a comparative study of Graph Neural Networks, Convolutional Neural Networks, and standard feedforward Neural Networks using synthetic data generated from a simplified and synthetic pocket park environment. Beginning with this illustrative case study, allows for controlled analysis of each model's ability to transfer learned preference patterns to unseen spatial scenarios. The models are evaluated based on their capacity to predict preferences influenced by heterogeneous physical, environmental, and social features. Generalizability score is calculated using the area under the precision-recall curve for the seen and unseen layouts. This generalizability score is appropriate for imbalanced data, providing insights into the suitability of each neural network architecture for preference-aware human behavior modeling in unseen built environments.
comment: The 15th International Workshop on Structural Health Monitoring (IWSHM)
☆ Interpretable Machine Learning for Cognitive Aging: Handling Missing Data and Uncovering Social Determinant
Early detection of Alzheimer's disease (AD) is crucial because its neurodegenerative effects are irreversible, and neuropathologic and social-behavioral risk factors accumulate years before diagnosis. Identifying higher-risk individuals earlier enables prevention, timely care, and equitable resource allocation. We predict cognitive performance from social determinants of health (SDOH) using the NIH NIA-supported PREPARE Challenge Phase 2 dataset derived from the nationally representative Mex-Cog cohort of the 2003 and 2012 Mexican Health and Aging Study (MHAS). Data: The target is a validated composite cognitive score across seven domains-orientation, memory, attention, language, constructional praxis, and executive function-derived from the 2016 and 2021 MHAS waves. Predictors span demographic, socioeconomic, health, lifestyle, psychosocial, and healthcare access factors. Methodology: Missingness was addressed with a singular value decomposition (SVD)-based imputation pipeline treating continuous and categorical variables separately. This approach leverages latent feature correlations to recover missing values while balancing reliability and scalability. After evaluating multiple methods, XGBoost was chosen for its superior predictive performance. Results and Discussion: The framework outperformed existing methods and the data challenge leaderboard, demonstrating high accuracy, robustness, and interpretability. SHAP-based post hoc analysis identified top contributing SDOH factors and age-specific feature patterns. Notably, flooring material emerged as a strong predictor, reflecting socioeconomic and environmental disparities. Other influential factors, age, SES, lifestyle, social interaction, sleep, stress, and BMI, underscore the multifactorial nature of cognitive aging and the value of interpretable, data-driven SDOH modeling.
☆ Redundancy as a Structural Information Principle for Learning and Generalization
We present a theoretical framework that extends classical information theory to finite and structured systems by redefining redundancy as a fundamental property of information organization rather than inefficiency. In this framework, redundancy is expressed as a general family of informational divergences that unifies multiple classical measures, such as mutual information, chi-squared dependence, and spectral redundancy, under a single geometric principle. This reveals that these traditional quantities are not isolated heuristics but projections of a shared redundancy geometry. The theory further predicts that redundancy is bounded both above and below, giving rise to an optimal equilibrium that balances over-compression (loss of structure) and over-coupling (collapse). While classical communication theory favors minimal redundancy for transmission efficiency, finite and structured systems, such as those underlying real-world learning, achieve maximal stability and generalization near this equilibrium. Experiments with masked autoencoders are used to illustrate and verify this principle: the model exhibits a stable redundancy level where generalization peaks. Together, these results establish redundancy as a measurable and tunable quantity that bridges the asymptotic world of communication and the finite world of learning.
☆ Neutral Agent-based Adversarial Policy Learning against Deep Reinforcement Learning in Multi-party Open Systems
Reinforcement learning (RL) has been an important machine learning paradigm for solving long-horizon sequential decision-making problems under uncertainty. By integrating deep neural networks (DNNs) into the RL framework, deep reinforcement learning (DRL) has emerged, which achieved significant success in various domains. However, the integration of DNNs also makes it vulnerable to adversarial attacks. Existing adversarial attack techniques mainly focus on either directly manipulating the environment with which a victim agent interacts or deploying an adversarial agent that interacts with the victim agent to induce abnormal behaviors. While these techniques achieve promising results, their adoption in multi-party open systems remains limited due to two major reasons: impractical assumption of full control over the environment and dependent on interactions with victim agents. To enable adversarial attacks in multi-party open systems, in this paper, we redesigned an adversarial policy learning approach that can mislead well-trained victim agents without requiring direct interactions with these agents or full control over their environments. Particularly, we propose a neutral agent-based approach across various task scenarios in multi-party open systems. While the neutral agents seemingly are detached from the victim agents, indirectly influence them through the shared environment. We evaluate our proposed method on the SMAC platform based on Starcraft II and the autonomous driving simulation platform Highway-env. The experimental results demonstrate that our method can launch general and effective adversarial attacks in multi-party open systems.
☆ End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF: A Reproducibility Study
We present a reproducibility study of the state-of-the-art neural architecture for sequence labeling proposed by Ma and Hovy (2016)\cite{ma2016end}. The original BiLSTM-CNN-CRF model combines character-level representations via Convolutional Neural Networks (CNNs), word-level context modeling through Bi-directional Long Short-Term Memory networks (BiLSTMs), and structured prediction using Conditional Random Fields (CRFs). This end-to-end approach eliminates the need for hand-crafted features while achieving excellent performance on named entity recognition (NER) and part-of-speech (POS) tagging tasks. Our implementation successfully reproduces the key results, achieving 91.18\% F1-score on CoNLL-2003 NER and demonstrating the model's effectiveness across sequence labeling tasks. We provide a detailed analysis of the architecture components and release an open-source PyTorch implementation to facilitate further research.
☆ Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new ``Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional ``Generate then Select'' paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.
comment: 19 pages, 10 figures
☆ FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
☆ DreamMakeup: Face Makeup Customization using Latent Diffusion Models
The exponential growth of the global makeup market has paralleled advancements in virtual makeup simulation technology. Despite the progress led by GANs, their application still encounters significant challenges, including training instability and limited customization capabilities. Addressing these challenges, we introduce DreamMakup - a novel training-free Diffusion model based Makeup Customization method, leveraging the inherent advantages of diffusion models for superior controllability and precise real-image editing. DreamMakeup employs early-stopped DDIM inversion to preserve the facial structure and identity while enabling extensive customization through various conditioning inputs such as reference images, specific RGB colors, and textual descriptions. Our model demonstrates notable improvements over existing GAN-based and recent diffusion-based frameworks - improved customization, color-matching capabilities, identity preservation and compatibility with textual descriptions or LLMs with affordable computational costs.
☆ LPCVAE: A Conditional VAE with Long-Term Dependency and Probabilistic Time-Frequency Fusion for Time Series Anomaly Detection
Time series anomaly detection(TSAD) is a critical task in signal processing field, ensuring the reliability of complex systems. Reconstruction-based methods dominate in TSAD. Among these methods, VAE-based methods have achieved promising results. Existing VAE-based methods suffer from the limitation of single-window feature and insufficient leveraging of long-term time and frequency information. We propose a Conditional Variational AutoEncoder with Long-term dependency and Probabilistic time-frequency fusion, named LPCVAE. LPCVAE introduces LSTM to capture long-term dependencies beyond windows. It further incorporates a Product-of-Experts (PoE) mechanism for adaptive and distribution-level probabilistic fusion. This design effectively mitigates time-frequency information loss. Extensive experiments on four public datasets demonstrate it outperforms state-of-the-art methods. The results confirm that integrating long-term time and frequency representations with adaptive fusion yields a robust and efficient solution for TSAD.
Quantifying Information Disclosure During Gradient Descent Using Gradient Uniqueness
Disclosing private information via publication of a machine learning model is often a concern. Intuitively, publishing a learned model should be less risky than publishing a dataset. But how much risk is there? In this paper, we present a principled disclosure metric called \emph{gradient uniqueness} that is derived from an upper bound on the amount of information disclosure from publishing a learned model. Gradient uniqueness provides an intuitive way to perform privacy auditing. The mathematical derivation of gradient uniqueness is general, and does not make any assumption on the model architecture, dataset type, or the strategy of an attacker. We examine a simple defense based on monitoring gradient uniqueness, and find that it achieves privacy comparable to classical methods such as DP-SGD, while being substantially better in terms of (utility) testing accuracy.
☆ Topological Alignment of Shared Vision-Language Embedding Space
Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.
comment: 24 pages, 5 figures, 19 tables
☆ Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application
Random forest is an important method for ML applications due to its broad outperformance over competing methods for structured tabular data. We propose a method for transfer learning in nonparametric regression using a centered random forest (CRF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are different for a few features (sparsely different). Our method first obtains residuals from predicting the response in the target domain using a source domain-trained CRF. Then, we fit another CRF to the residuals, but with feature splitting probabilities proportional to the sample distance covariance between the features and the residuals in an independent sample. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard random forest (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF in some situations. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients.
Quantifying Dataset Similarity to Guide Transfer Learning
Transfer learning has become a cornerstone of modern machine learning, as it can empower models by leveraging knowledge from related domains to improve learning effectiveness. However, transferring from poorly aligned data can harm rather than help performance, making it crucial to determine whether the transfer will be beneficial before implementation. This work aims to address this challenge by proposing an innovative metric to measure dataset similarity and provide quantitative guidance on transferability. In the literature, existing methods largely focus on feature distributions while overlooking label information and predictive relationships, potentially missing critical transferability insights. In contrast, our proposed metric, the Cross-Learning Score (CLS), measures dataset similarity through bidirectional generalization performance between domains. We provide a theoretical justification for CLS by establishing its connection to the cosine similarity between the decision boundaries for the target and source datasets. Computationally, CLS is efficient and fast to compute as it bypasses the problem of expensive distribution estimation for high-dimensional problems. We further introduce a general framework that categorizes source datasets into positive, ambiguous, or negative transfer zones based on their CLS relative to the baseline error, enabling informed decisions. Additionally, we extend this approach to encoder-head architectures in deep learning to better reflect modern transfer pipelines. Extensive experiments on diverse synthetic and real-world tasks demonstrate that CLS can reliably predict whether transfer will improve or degrade performance, offering a principled tool for guiding data selection in transfer learning.
☆ HeroFilter: Adaptive Spectral Graph Filter for Varying Heterophilic Relations
Graph heterophily, where connected nodes have different labels, has attracted significant interest recently. Most existing works adopt a simplified approach - using low-pass filters for homophilic graphs and high-pass filters for heterophilic graphs. However, we discover that the relationship between graph heterophily and spectral filters is more complex - the optimal filter response varies across frequency components and does not follow a strict monotonic correlation with heterophily degree. This finding challenges conventional fixed filter designs and suggests the need for adaptive filtering to preserve expressiveness in graph embeddings. Formally, natural questions arise: Given a heterophilic graph G, how and to what extent will the varying heterophily degree of G affect the performance of GNNs? How can we design adaptive filters to fit those varying heterophilic connections? Our theoretical analysis reveals that the average frequency response of GNNs and graph heterophily degree do not follow a strict monotonic correlation, necessitating adaptive graph filters to guarantee good generalization performance. Hence, we propose [METHOD NAME], a simple yet powerful GNN, which extracts information across the heterophily spectrum and combines salient representations through adaptive mixing. [METHOD NAME]'s superior performance achieves up to 9.2% accuracy improvement over leading baselines across homophilic and heterophilic graphs.
☆ A Joint Learning Approach to Hardware Caching and Prefetching NeurIPS 2025
Several learned policies have been proposed to replace heuristics for scheduling, caching, and other system components in modern systems. By leveraging diverse features, learning from historical trends, and predicting future behaviors, such models promise to keep pace with ever-increasing workload dynamism and continuous hardware evolution. However, policies trained in isolation may still achieve suboptimal performance when placed together. In this paper, we inspect one such instance in the domain of hardware caching -- for the policies of cache replacement and prefetching. We argue that these two policies are bidirectionally interdependent and make the case for training the two jointly. We propose a joint learning approach based on developing shared representations for the features used by the two policies. We present two approaches to develop these shared representations, one based on a joint encoder and another based on contrastive learning of the embeddings, and demonstrate promising preliminary results for both of these. Finally, we lay down an agenda for future research in this direction.
comment: Accepted at ML for Systems Workshop at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval
Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.
☆ Statistical Guarantees for High-Dimensional Stochastic Gradient Descent NeurIPS 2025
Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the $q$-th moment convergence of SGD and ASGD for any $q\ge2$ in general $\ell^s$-norms, and, in particular, the $\ell^{\infty}$-norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.
comment: Accepted to NeurIPS 2025
☆ UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.
☆ Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives
Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.
☆ Learning by Steering the Neural Dynamics: A Statistical Mechanics Perspective
Despite the striking successes of deep neural networks trained with gradient-based optimization, these methods differ fundamentally from their biological counterparts. This gap raises key questions about how nature achieves robust, sample-efficient learning at minimal energy costs and solves the credit-assignment problem without backpropagation. We take a step toward bridging contemporary AI and computational neuroscience by studying how neural dynamics can support fully local, distributed learning that scales to simple machine-learning benchmarks. Using tools from statistical mechanics, we identify conditions for the emergence of robust dynamical attractors in random asymmetric recurrent networks. We derive a closed-form expression for the number of fixed points as a function of self-coupling strength, and we reveal a phase transition in their structure: below a critical self-coupling, isolated fixed points coexist with exponentially many narrow clusters showing the overlap-gap property; above it, subdominant yet dense and extensive clusters appear. These fixed points become accessible, including to a simple asynchronous dynamical rule, after an algorithm-dependent self-coupling threshold. Building on this analysis, we propose a biologically plausible algorithm for supervised learning with any binary recurrent network. Inputs are mapped to fixed points of the dynamics, by relaxing under transient external stimuli and stabilizing the resulting configurations via local plasticity. We show that our algorithm can learn an entangled version of MNIST, leverages depth to develop hierarchical representations and increase hetero-association capacity, and is applicable to several architectures. Finally, we highlight the strong connection between algorithm performance and the unveiled phase transition, and we suggest a cortex-inspired alternative to self-couplings for its emergence.
☆ Learning Dynamics of VLM Finetuning
Preference-based finetuning of vision--language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbf{learning-dynamics--aware optimization} and introduce \textbf{Cooling-Weighted DPO (CW-DPO)}, a two-stage recipe that explicitly models and exploits the training trajectory. \textbf{Stage 1} performs supervised finetuning with \textbf{gentle negatives}: \textbf{low-weight smoothed supervision} that regularizes the base policy and curbs overconfidence without explicit penalties. \textbf{Stage 2} applies a DPO objective in which the \textbf{negative term is scaled by a cooling weight} computed from the model's \textbf{average token log-probability} on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize \textbf{on-policy negatives} and allow \textbf{mixed negatives} by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $\Delta\!\log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields \textbf{more stable optimization}, \textbf{better calibration}, and \textbf{higher pairwise win-rates} than SFT-only and vanilla DPO, while \textbf{converging in fewer steps}. Ablations isolate the \textbf{cooling-weight mechanism} as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that \textbf{smoothing learning dynamics before cooling preferences} is a simple, general principle for robust VLM alignment.
☆ Scaling Long-Horizon LLM Agent via Context-Folding
Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10$\times$ smaller and significantly outperforms models that rely on summarization-based context management.
☆ QLENS: Towards A Quantum Perspective of Language Transformers
In natural language processing, current methods for understanding Transformers are successful at identifying intermediate predictions during a model's inference. However, these approaches function as limited diagnostic checkpoints, lacking a mathematical framework for mechanistically modeling how each layer facilitates transitions between these evolving states. This interpretability gap and past successes of interdisciplinary outlooks inspire us to turn to physics in search of a descriptive mathematical framework for Transformers. We observe that language models are intrinsically probabilistic, an attribute that is echoed in the core postulates of quantum mechanics. This parallel inspires us to translate insights from this discipline to that of natural language processing. Towards this objective, we propose QLENS a novel attempt to develop a physics-based perspective on the Transformer generation process. Under QLENS, a Transformer is studied by converting its latent activations into a state vector in a Hilbert space derived from the model's output units. This state subsequently evolves through hidden layers - reformulated as unitary operators and analogously defined Hamiltonians - during inference. The model's final probability distribution is obtained by applying the Born rule to the end state using a specific measurement operator. To demonstrate QLENS's potential, we conduct a proof-of-concept by probing a toy Transformer to investigate the influence of individual layers in a model's prediction trajectory. We present our work as a foundation for cross-domain insights to be leveraged towards a broader understanding of Transformers.
☆ MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics ICCV 2025
Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model's inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.
comment: International Conference on Computer Vision, ICCV 2025
☆ Y-shaped Generative Flows
Modern continuous-time generative models often induce V-shaped transport: each sample travels independently along nearly straight trajectories from prior to data, overlooking shared structure. We introduce Y-shaped generative flows, which move probability mass together along shared pathways before branching to target-specific endpoints. Our formulation is based on novel velocity-powered transport cost with a sublinear exponent (between zero and one). this concave dependence rewards joint and fast mass movement. Practically, we instantiate the idea in a scalable neural ODE training objective. On synthetic, image, and biology datasets, Y-flows recover hierarchy-aware structure, improve distributional metrics over strong flow-based baselines, and reach targets with fewer integration steps.
☆ Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors
Learning disentangled representations, where distinct factors of variation are captured by independent latent variables, is a central goal in machine learning. The dominant approach has been the Variational Autoencoder (VAE) framework, which uses a Kullback-Leibler (KL) divergence penalty to encourage the latent space to match a factorized Gaussian prior. In this work, however, we provide direct evidence that this KL-based regularizer is an unreliable mechanism, consistently failing to enforce the target distribution on the aggregate posterior. We validate this and quantify the resulting entanglement using our novel, unsupervised Latent Predictability Score (LPS). To address this failure, we introduce the Programmable Prior Framework, a method built on the Maximum Mean Discrepancy (MMD). Our framework allows practitioners to explicitly sculpt the latent space, achieving state-of-the-art mutual independence on complex datasets like CIFAR-10 and Tiny ImageNet without the common reconstruction trade-off. Furthermore, we demonstrate how this programmability can be used to engineer sophisticated priors that improve alignment with semantically meaningful features. Ultimately, our work provides a foundational tool for representation engineering, opening new avenues for model identifiability and causal reasoning.
☆ On efficiently computable functions, deep networks and sparse compositionality
We show that \emph{efficient Turing computability} at any fixed input/output precision implies the existence of \emph{compositionally sparse} (bounded-fan-in, polynomial-size) DAG representations and of corresponding neural approximants achieving the target precision. Concretely: if $f:[0,1]^d\to\R^m$ is computable in time polynomial in the bit-depths, then for every pair of precisions $(n,m_{\mathrm{out}})$ there exists a bounded-fan-in Boolean circuit of size and depth $\poly(n+m_{\mathrm{out}})$ computing the discretized map; replacing each gate by a constant-size neural emulator yields a deep network of size/depth $\poly(n+m_{\mathrm{out}})$ that achieves accuracy $\varepsilon=2^{-m_{\mathrm{out}}}$. We also relate these constructions to compositional approximation rates \cite{MhaskarPoggio2016b,poggio_deep_shallow_2017,Poggio2017,Poggio2023HowDS} and to optimization viewed as hierarchical search over sparse structures.
Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning NeurIPS 2025
In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to $91$% relative to RestartQ-UCB.
comment: This paper contains 19 pages and 3 figures. To be presented at the 2nd Workshop on Aligning Reinforcement Learning Experimentalists and Theorists (ARLET 2025) at NeurIPS 2025
☆ Indoor Localization using Compact, Telemetry-Agnostic, Transfer-Learning Enabled Decoder-Only Transformer
Indoor Wi-Fi positioning remains a challenging problem due to the high sensitivity of radio signals to environmental dynamics, channel propagation characteristics, and hardware heterogeneity. Conventional fingerprinting and model-based approaches typically require labor-intensive calibration and suffer rapid performance degradation when devices, channel or deployment conditions change. In this paper, we introduce Locaris, a decoder-only large language model (LLM) for indoor localization. Locaris treats each access point (AP) measurement as a token, enabling the ingestion of raw Wi-Fi telemetry without pre-processing. By fine-tuning its LLM on different Wi-Fi datasets, Locaris learns a lightweight and generalizable mapping from raw signals directly to device location. Our experimental study comparing Locaris with state-of-the-art methods consistently shows that Locaris matches or surpasses existing techniques for various types of telemetry. Our results demonstrate that compact LLMs can serve as calibration-free regression models for indoor localization, offering scalable and robust cross-environment performance in heterogeneous Wi-Fi deployments. Few-shot adaptation experiments, using only a handful of calibration points per device, further show that Locaris maintains high accuracy when applied to previously unseen devices and deployment scenarios. This yields sub-meter accuracy with just a few hundred samples, robust performance under missing APs and supports any and all available telemetry. Our findings highlight the practical viability of Locaris for indoor positioning in the real-world scenarios, particularly in large-scale deployments where extensive calibration is infeasible.
comment: 11 pages, 12 Figures
☆ Enhancing Diffusion-Based Sampling with Molecular Collective Variables
Diffusion-based samplers learn to sample complex, high-dimensional distributions using energies or log densities alone, without training data. Yet, they remain impractical for molecular sampling because they are often slower than molecular dynamics and miss thermodynamically relevant modes. Inspired by enhanced sampling, we encourage exploration by introducing a sequential bias along bespoke, information-rich, low-dimensional projections of atomic coordinates known as collective variables (CVs). We introduce a repulsive potential centered on the CVs from recent samples, which pushes future samples towards novel CV regions and effectively increases the temperature in the projected space. Our resulting method improves efficiency, mode discovery, enables the estimation of free energy differences, and retains independent sampling from the approximate Boltzmann distribution via reweighting by the bias. On standard peptide conformational sampling benchmarks, the method recovers diverse conformational states and accurate free energy profiles. We are the first to demonstrate reactive sampling using a diffusion-based sampler, capturing bond breaking and formation with universal interatomic potentials at near-first-principles accuracy. The approach resolves reactive energy landscapes at a fraction of the wall-clock time of standard sampling methods, advancing diffusion-based sampling towards practical use in molecular sciences.
☆ Variational Mixture of Graph Neural Experts for Alzheimer's Disease Biomarker Recognition in EEG Brain Networks
Dementia disorders such as Alzheimer's disease (AD) and frontotemporal dementia (FTD) exhibit overlapping electrophysiological signatures in EEG that challenge accurate diagnosis. Existing EEG-based methods are limited by full-band frequency analysis that hinders precise differentiation of dementia subtypes and severity stages. We propose a variational mixture of graph neural experts (VMoGE) that integrates frequency-specific biomarker identification with structured variational inference for enhanced dementia diagnosis and staging. VMoGE employs a multi-granularity transformer to extract multi-scale temporal patterns across four frequency bands, followed by a variational graph convolutional encoder using Gaussian Markov Random Field priors. Through structured variational inference and adaptive gating, VMoGE links neural specialization to physiologically meaningful EEG frequency bands. Evaluated on two diverse datasets for both subtype classification and severity staging, VMoGE achieves superior performance with AUC improvements of +4% to +10% over state-of-the-art methods. Moreover, VMoGE provides interpretable insights through expert weights that correlate with clinical indicators and spatial patterns aligned with neuropathological signatures, facilitating EEG biomarker discovery for comprehensive dementia diagnosis and monitoring.
☆ Simplifying Optimal Transport through Schatten-$p$ Regularization
We propose a new general framework for recovering low-rank structure in optimal transport using Schatten-$p$ norm regularization. Our approach extends existing methods that promote sparse and interpretable transport maps or plans, while providing a unified and principled family of convex programs that encourage low-dimensional structure. The convexity of our formulation enables direct theoretical analysis: we derive optimality conditions and prove recovery guarantees for low-rank couplings and barycentric maps in simplified settings. To efficiently solve the proposed program, we develop a mirror descent algorithm with convergence guarantees for $p \geq 1$. Experiments on synthetic and real data demonstrate the method's efficiency, scalability, and ability to recover low-rank transport structures.
comment: 26 pages, 4 figures
☆ LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance
For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings -- often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness -- i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement's exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.
☆ Integrating Sequential and Relational Modeling for User Events: Datasets and Prediction Tasks
User event modeling plays a central role in many machine learning applications, with use cases spanning e-commerce, social media, finance, cybersecurity, and other domains. User events can be broadly categorized into personal events, which involve individual actions, and relational events, which involve interactions between two users. These two types of events are typically modeled separately, using sequence-based methods for personal events and graph-based methods for relational events. Despite the need to capture both event types in real-world systems, prior work has rarely considered them together. This is often due to the convenient simplification that user behavior can be adequately represented by a single formalization, either as a sequence or a graph. To address this gap, there is a need for public datasets and prediction tasks that explicitly incorporate both personal and relational events. In this work, we introduce a collection of such datasets, propose a unified formalization, and empirically show that models benefit from incorporating both event types. Our results also indicate that current methods leave a notable room for improvements. We release these resources to support further research in unified user event modeling and encourage progress in this direction.
☆ ADARL: Adaptive Low-Rank Structures for Robust Policy Learning under Uncertainty
Robust reinforcement learning (Robust RL) seeks to handle epistemic uncertainty in environment dynamics, but existing approaches often rely on nested min--max optimization, which is computationally expensive and yields overly conservative policies. We propose \textbf{Adaptive Rank Representation (AdaRL)}, a bi-level optimization framework that improves robustness by aligning policy complexity with the intrinsic dimension of the task. At the lower level, AdaRL performs policy optimization under fixed-rank constraints with dynamics sampled from a Wasserstein ball around a centroid model. At the upper level, it adaptively adjusts the rank to balance the bias--variance trade-off, projecting policy parameters onto a low-rank manifold. This design avoids solving adversarial worst-case dynamics while ensuring robustness without over-parameterization. Empirical results on MuJoCo continuous control benchmarks demonstrate that AdaRL not only consistently outperforms fixed-rank baselines (e.g., SAC) and state-of-the-art robust RL methods (e.g., RNAC, Parseval), but also converges toward the intrinsic rank of the underlying tasks. These results highlight that adaptive low-rank policy representations provide an efficient and principled alternative for robust RL under model uncertainty.
☆ High-Probability Bounds For Heterogeneous Local Differential Privacy
We study statistical estimation under local differential privacy (LDP) when users may hold heterogeneous privacy levels and accuracy must be guaranteed with high probability. Departing from the common in-expectation analyses, and for one-dimensional and multi-dimensional mean estimation problems, we develop finite sample upper bounds in $\ell_2$-norm that hold with probability at least $1-\beta$. We complement these results with matching minimax lower bounds, establishing the optimality (up to constants) of our guarantees in the heterogeneous LDP regime. We further study distribution learning in $\ell_\infty$-distance, designing an algorithm with high-probability guarantees under heterogeneous privacy demands. Our techniques offer principled guidance for designing mechanisms in settings with user-specific privacy levels.
☆ Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling NeurIPS 2025
The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.
comment: Accepted by Reliable ML Workshop @ NeurIPS 2025
☆ Active Subspaces in Infinite Dimension
Active subspace analysis uses the leading eigenspace of the gradient's second moment to conduct supervised dimension reduction. In this article, we extend this methodology to real-valued functionals on Hilbert space. We define an operator which coincides with the active subspace matrix when applied to a Euclidean space. We show that many of the desirable properties of Active Subspace analysis extend directly to the infinite dimensional setting. We also propose a Monte Carlo procedure and discuss its convergence properties. Finally, we deploy this methodology to create visualizations and improve modeling and optimization on complex test problems.
☆ Improving Knowledge Graph Embeddings through Contrastive Learning with Negative Statements
Knowledge graphs represent information as structured triples and serve as the backbone for a wide range of applications, including question answering, link prediction, and recommendation systems. A prominent line of research for exploring knowledge graphs involves graph embedding methods, where entities and relations are represented in low-dimensional vector spaces that capture underlying semantics and structure. However, most existing methods rely on assumptions such as the Closed World Assumption or Local Closed World Assumption, treating missing triples as false. This contrasts with the Open World Assumption underlying many real-world knowledge graphs. Furthermore, while explicitly stated negative statements can help distinguish between false and unknown triples, they are rarely included in knowledge graphs and are often overlooked during embedding training. In this work, we introduce a novel approach that integrates explicitly declared negative statements into the knowledge embedding learning process. Our approach employs a dual-model architecture, where two embedding models are trained in parallel, one on positive statements and the other on negative statements. During training, each model generates negative samples by corrupting positive samples and selecting the most likely candidates as scored by the other model. The proposed approach is evaluated on both general-purpose and domain-specific knowledge graphs, with a focus on link prediction and triple classification tasks. The extensive experiments demonstrate that our approach improves predictive performance over state-of-the-art embedding models, demonstrating the value of integrating meaningful negative knowledge into embedding learning.
comment: Accepted at the Thirteenth International Conference on Knowledge Capture (K-CAP 2025)
☆ Actor-Enriched Time Series Forecasting of Process Performance
Predictive Process Monitoring (PPM) is a key task in Process Mining that aims to predict future behavior, outcomes, or performance indicators. Accurate prediction of the latter is critical for proactive decision-making. Given that processes are often resource-driven, understanding and incorporating actor behavior in forecasting is crucial. Although existing research has incorporated aspects of actor behavior, its role as a time-varying signal in PPM remains limited. This study investigates whether incorporating actor behavior information, modeled as time series, can improve the predictive performance of throughput time (TT) forecasting models. Using real-life event logs, we construct multivariate time series that include TT alongside actor-centric features, i.e., actor involvement, the frequency of continuation, interruption, and handover behaviors, and the duration of these behaviors. We train and compare several models to study the benefits of adding actor behavior. The results show that actor-enriched models consistently outperform baseline models, which only include TT features, in terms of RMSE, MAE, and R2. These findings demonstrate that modeling actor behavior over time and incorporating this information into forecasting models enhances performance indicator predictions.
comment: Accepted at ICPM 2025
☆ Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection ICDM
Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models' capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model's performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.
comment: Accepted to ICDMW 2025 Workshop on Multimodal AI (MMAI). Full workshop info: https://icdmw25mmai.github.io/
☆ Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities NeurIPS 2025
Adapting language models to new tasks through continued pretraining faces a fundamental trade-off: models must learn new capabilities while avoiding catastrophic forgetting of existing knowledge. While prior work has studied synthetic data generation techniques, the optimal replay ratios for balancing task performance and knowledge retention under computational constraints remain poorly understood. We present a comprehensive empirical study investigating the interplay between replay ratio configuration and computational budget when adapting language models to new tasks. Using the bAbI reasoning tasks as our target objective, we apply synthetic data generation and systematically evaluate different total token budgets and replay ratio configurations. We analyze their effects on both task mastery and general knowledge retention. Our experiments reveal an optimal configuration that balances task-specific performance with general knowledge retention. Based on our findings, we provide empirically-grounded guidelines for selecting replay ratios based on computational budget, enabling practitioners to achieve strong task adaptation with significantly reduced training costs.
comment: Presented at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Continual and Compatible Foundation Model Updates (CCFM)
☆ WaveletDiff: Multilevel Wavelet Diffusion For Time Series Generation
Time series are ubiquitous in many applications that involve forecasting, classification and causal inference tasks, such as healthcare, finance, audio signal processing and climate sciences. Still, large, high-quality time series datasets remain scarce. Synthetic generation can address this limitation; however, current models confined either to the time or frequency domains struggle to reproduce the inherently multi-scaled structure of real-world time series. We introduce WaveletDiff, a novel framework that trains diffusion models directly on wavelet coefficients to exploit the inherent multi-resolution structure of time series data. The model combines dedicated transformers for each decomposition level with cross-level attention mechanisms that enable selective information exchange between temporal and frequency scales through adaptive gating. It also incorporates energy preservation constraints for individual levels based on Parseval's theorem to preserve spectral fidelity throughout the diffusion process. Comprehensive tests across six real-world datasets from energy, finance, and neuroscience domains demonstrate that WaveletDiff consistently outperforms state-of-the-art time-domain and frequency-domain generative methods on both short and long time series across five diverse performance metrics. For example, WaveletDiff achieves discriminative scores and Context-FID scores that are $3\times$ smaller on average than the second-best baseline across all datasets.
☆ Data or Language Supervision: What Makes CLIP Better than DINO?
CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.
comment: EMNLP 2025 Findings
☆ Don't Walk the Line: Boundary Guidance for Filtered Generation
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
comment: 9 pages, 3 figures, 3 tables
☆ Z0-Inf: Zeroth Order Approximation for Data Influence
A critical aspect of analyzing and improving modern machine learning systems lies in understanding how individual training examples influence a model's predictive behavior. Estimating this influence enables critical applications, including data selection and model debugging; in particular, self-influence, which quantifies the influence of a training point on itself, has found many uses in data quality assessment and outlier detection. Existing methods for measuring data influence, however, are often impractical for large models due to low accuracy or prohibitive computational costs: most approaches either provide poor approximations or rely on gradients and inverse-Hessian computations that remain challenging to scale. In this work, we introduce a highly efficient zeroth-order approximation for estimating the influence of training data that requires only a fraction of the time and memory footprint of prior methods. Notably, our method relies solely on loss values of intermediate checkpoints on the training and test data, along with the checkpoints themselves, making it broadly applicable even when the loss function of interest is non-differentiable. Beyond its computational efficiency, our approach achieves superior accuracy in estimating self-influence and comparable or improved accuracy in estimating train-test influence for fine-tuned large language models, enabling scalable and practical analysis of how training data shapes model behavior.
☆ Schrödinger bridge for generative AI: Soft-constrained formulation and convergence analysis
Generative AI can be framed as the problem of learning a model that maps simple reference measures into complex data distributions, and it has recently found a strong connection to the classical theory of the Schr\"odinger bridge problems (SBPs) due partly to their common nature of interpolating between prescribed marginals via entropy-regularized stochastic dynamics. However, the classical SBP enforces hard terminal constraints, which often leads to instability in practical implementations, especially in high-dimensional or data-scarce regimes. To address this challenge, we follow the idea of the so-called soft-constrained Schr\"odinger bridge problem (SCSBP), in which the terminal constraint is replaced by a general penalty function. This relaxation leads to a more flexible stochastic control formulation of McKean-Vlasov type. We establish the existence of optimal solutions for all penalty levels and prove that, as the penalty grows, both the controls and value functions converge to those of the classical SBP at a linear rate. Our analysis builds on Doob's h-transform representations, the stability results of Schr\"odinger potentials, Gamma-convergence, and a novel fixed-point argument that couples an optimization problem over the space of measures with an auxiliary entropic optimal transport problem. These results not only provide the first quantitative convergence guarantees for soft-constrained bridges but also shed light on how penalty regularization enables robust generative modeling, fine-tuning, and transfer learning.
comment: 31 pages
☆ Combining Euclidean and Hyperbolic Representations for Node-level Anomaly Detection
Node-level anomaly detection (NAD) is challenging due to diverse structural patterns and feature distributions. As such, NAD is a critical task with several applications which range from fraud detection, cybersecurity, to recommendation systems. We introduce Janus, a framework that jointly leverages Euclidean and Hyperbolic Graph Neural Networks to capture complementary aspects of node representations. Each node is described by two views, composed by the original features and structural features derived from random walks and degrees, then embedded into Euclidean and Hyperbolic spaces. A multi Graph-Autoencoder framework, equipped with a contrastive learning objective as regularization term, aligns the embeddings across the Euclidean and Hyperbolic spaces, highlighting nodes whose views are difficult to reconcile and are thus likely anomalous. Experiments on four real-world datasets show that Janus consistently outperforms shallow and deep baselines, empirically demonstrating that combining multiple geometric representations provides a robust and effective approach for identifying subtle and complex anomalies in graphs.
☆ Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning NeurIPS 2025
In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions--a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at https://github.com/BUAA-TrustworthyMARL/adv_marl_benchmark .
comment: 44 pages, 16 figures, NeurIPS 2025
☆ Enhancing the Quality of 3D Lunar Maps Using JAXA's Kaguya Imagery
As global efforts to explore the Moon intensify, the need for high-quality 3D lunar maps becomes increasingly critical-particularly for long-distance missions such as NASA's Endurance mission concept, in which a rover aims to traverse 2,000 km across the South Pole-Aitken basin. Kaguya TC (Terrain Camera) images, though globally available at 10 m/pixel, suffer from altitude inaccuracies caused by stereo matching errors and JPEG-based compression artifacts. This paper presents a method to improve the quality of 3D maps generated from Kaguya TC images, focusing on mitigating the effects of compression-induced noise in disparity maps. We analyze the compression behavior of Kaguya TC imagery, and identify systematic disparity noise patterns, especially in darker regions. In this paper, we propose an approach to enhance 3D map quality by reducing residual noise in disparity images derived from compressed images. Our experimental results show that the proposed approach effectively reduces elevation noise, enhancing the safety and reliability of terrain data for future lunar missions.
comment: Presented at IEEE SMC 2025
☆ On Thompson Sampling and Bilateral Uncertainty in Additive Bayesian Optimization
In Bayesian Optimization (BO), additive assumptions can mitigate the twin difficulties of modeling and searching a complex function in high dimension. However, common acquisition functions, like the Additive Lower Confidence Bound, ignore pairwise covariances between dimensions, which we'll call \textit{bilateral uncertainty} (BU), imposing a second layer of approximations. While theoretical results indicate that asymptotically not much is lost in doing so, little is known about the practical effects of this assumption in small budgets. In this article, we show that by exploiting conditional independence, Thompson Sampling respecting BU can be efficiently conducted. We use this fact to execute an empirical investigation into the loss incurred by ignoring BU, finding that the additive approximation to Thompson Sampling does indeed have, on balance, worse performance than the exact method, but that this difference is of little practical significance. This buttresses the theoretical understanding and suggests that the BU-ignoring approximation is sufficient for BO in practice, even in the non-asymptotic regime.
☆ Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ with $M$ being the sample size, depending only on the smoothness $\beta$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
☆ GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. GAR introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover's evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with GAR training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of 4.20% on MiniF2F-Test benchmark, while DeepSeek-Prover-V2's pass@32 on ProofNet-Test increases from 22.58% to 25.81%. Beyond formal proving, GAR establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments.
♻ ☆ On Convolutions, Intrinsic Dimension, and Diffusion Models
The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) -- which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process -- have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.
comment: TMLR 2025 (expert certification)
♻ ☆ Causal Explanation of Concept Drift -- A Truly Actionable Approach ECML
In a world that constantly changes, it is crucial to understand how those changes impact different systems, such as industrial manufacturing or critical infrastructure. Explaining critical changes, referred to as concept drift in the field of machine learning, is the first step towards enabling targeted interventions to avoid or correct model failures, as well as malfunctions and errors in the physical world. Therefore, in this work, we extend model-based drift explanations towards causal explanations, which increases the actionability of the provided explanations. We evaluate our explanation strategy on a number of use cases, demonstrating the practical usefulness of our framework, which isolates the causally relevant features impacted by concept drift and, thus, allows for targeted intervention.
comment: This manuscript was presented at the TempXAI workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2025)
♻ ☆ Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression NeurIPS 2025
With the rise of the fine-tuned-pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 50x compression, (b) general NLP models (RoBERTa-base, T5-base) with up to 224x compression, (c) vision models (ViT-B/32, ViT-L/14) with up to 132x compression, and (d) multi-modal models (BEiT-3) with 18x compression, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression. Code is available at https://github.com/xiaohuiwang000/UltraDelta.
comment: Accepted at NeurIPS 2025
♻ ☆ Cost-aware Stopping for Bayesian Optimization
In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora's Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function usually matches or outperforms other acquisition-function--stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.
♻ ☆ The Hidden Link Between RLHF and Contrastive Learning
Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be interpreted as methods that performing contrastive learning based on the positive and negative samples derived from base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). Such paradigm further illuminates why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on the perspective, we replace the DV/MINE bound with the Jensen-Shannon (JS) MI estimator and propose the Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks.
♻ ☆ Provably faster randomized and quantum algorithms for $k$-means clustering via uniform sampling
The $k$-means algorithm (Lloyd's algorithm) is a widely used method for clustering unlabeled data. A key bottleneck of the $k$-means algorithm is that each iteration requires time linear in the number of data points, which can be expensive in big data applications. This was improved in recent works proposing quantum and quantum-inspired classical algorithms to approximate the $k$-means algorithm locally, in time depending only logarithmically on the number of data points (along with data dependent parameters) [q-means: A quantum algorithm for unsupervised machine learning, Kerenidis, Landman, Luongo, and Prakash, NeurIPS 2019; Do you know what $q$-means?, Cornelissen, Doriguello, Luongo, Tang, QTML 2025]. In this work, we describe a simple randomized mini-batch $k$-means algorithm and a quantum algorithm inspired by the classical algorithm. We demonstrate that the worst case guarantees of these algorithms can significantly improve upon the bounds for algorithms in prior work. Our improvements are due to a careful use of uniform sampling, which preserves certain symmetries of the $k$-means problem that are not preserved in previous algorithms that use data norm-based sampling.
comment: Included a comparison with [Do you know what $q$-means?, Cornelissen, Doriguello, Luongo, Tang, QTML 2025]; added numerical experiments for the classical algorithms; updated theory and fixed typos
♻ ☆ Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence. We first propose a holistic taxonomy of spatial tasks that unifies existing benchmarks and a standardized protocol for the fair evaluation of state-of-the-art proprietary and open-source models across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail even the most advanced multimodal models.
♻ ☆ Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot
In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.
comment: EMNLP25-findings camera_ready, 19 pages,22 figures
♻ ☆ The Minimal Search Space for Conditional Causal Bandits ICLR2026
Causal knowledge can be used to support decision-making problems. This has been recognized in the causal bandits literature, where a causal (multi-armed) bandit is characterized by a causal graphical model and a target variable. The arms are then interventions on the causal model, and rewards are samples of the target variable. Causal bandits were originally studied with a focus on hard interventions. We focus instead on cases where the arms are conditional interventions, which more accurately model many real-world decision-making problems by allowing the value of the intervened variable to be chosen based on the observed values of other variables. This paper presents a graphical characterization of the minimal set of nodes guaranteed to contain the optimal conditional intervention, which maximizes the expected reward. We then propose an efficient algorithm with a time complexity of $O(|V| + |E|)$ to identify this minimal set of nodes. We prove that the graphical characterization and the proposed algorithm are correct. Finally, we empirically demonstrate that our algorithm significantly prunes the search space and substantially accelerates convergence rates when integrated into standard multi-armed bandit algorithms.
comment: Submitted to ICLR2026
♻ ☆ MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction NeurIPS 2025
We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.
comment: Accepted by NeurIPS 2025
♻ ☆ Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.
♻ ☆ Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics
Understanding and modeling nonlinear dynamical systems is a fundamental challenge across science and engineering. Deep learning has shown remarkable potential for capturing complex system behavior, yet achieving models that are both accurate and physically interpretable remains difficult. To address this, we propose Structured Kolmogorov-Arnold Neural ODEs (SKANODEs), a framework that integrates structured state-space modeling with Kolmogorov-Arnold Networks (KANs). Within a Neural ODE architecture, SKANODE employs a fully trainable KAN as a universal function approximator to perform virtual sensing, recovering latent states that correspond to interpretable physical quantities such as displacements and velocities. Leveraging KAN's symbolic regression capability, SKANODE then extracts compact, interpretable expressions for the system's governing dynamics. Extensive experiments on simulated and real-world systems demonstrate that SKANODE achieves superior predictive accuracy, discovers physics-consistent dynamics, and reveals complex nonlinear behavior. Notably, it identifies hysteretic behavior in an F-16 aircraft and recovers a concise symbolic equation describing this phenomenon. SKANODE thus enables interpretable, data-driven discovery of physically grounded models for complex nonlinear dynamical systems.
♻ ☆ Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm NeurIPS 2025
Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.
comment: Accepted to NeurIPS 2025. Title used at submission and review: PIRO: Toward Stable Reward Learning for Inverse RL via Monotonic Policy Divergence Reduction
♻ ☆ ORN-CBF: Learning Observation-conditioned Residual Neural Control Barrier Functions via Hypernetworks
Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.
♻ ☆ Automated Machine Learning for Unsupervised Tabular Tasks
In this work, we present LOTUS (Learning to Learn with Optimal Transport for Unsupervised Scenarios), a simple yet effective method to perform model selection for multiple unsupervised machine learning(ML) tasks such as outlier detection and clustering. Our intuition behind this work is that a machine learning pipeline will perform well in a new dataset if it previously worked well on datasets with a similar underlying data distribution. We use Optimal Transport distances to find this similarity between unlabeled tabular datasets and recommend machine learning pipelines with one unified single method on two downstream unsupervised tasks: outlier detection and clustering. We present the effectiveness of our approach with experiments against strong baselines and show that LOTUS is a very promising first step toward model selection for multiple unsupervised ML tasks.
comment: Accepted at Machine Learning Journal, 2025
♻ ☆ Query Complexity of Classical and Quantum Channel Discrimination
Quantum channel discrimination has been studied from an information-theoretic perspective, wherein one is interested in the optimal decay rate of error probabilities as a function of the number of unknown channel accesses. In this paper, we study the query complexity of quantum channel discrimination, wherein the goal is to determine the minimum number of channel uses needed to reach a desired error probability. To this end, we show that the query complexity of binary channel discrimination depends logarithmically on the inverse error probability and inversely on the negative logarithm of the (geometric and Holevo) channel fidelity. As a special case of these findings, we precisely characterize the query complexity of discriminating two classical channels and two classical-quantum channels. Furthermore, by obtaining an optimal characterization of the sample complexity of quantum hypothesis testing, including prior probabilities, we provide a more precise characterization of query complexity when the error probability does not exceed a fixed threshold. We also provide lower and upper bounds on the query complexity of binary asymmetric channel discrimination and multiple quantum channel discrimination. For the former, the query complexity depends on the geometric R\'enyi and Petz R\'enyi channel divergences, while for the latter, it depends on the negative logarithm of the (geometric and Uhlmann) channel fidelity. For multiple channel discrimination, the upper bound scales as the logarithm of the number of channels.
comment: v3: 35 pages, published version; v2: 33 pages, Added tighter and precise characterization of sample and query complexity in Theorem 11 (for states), Theorem 12 (for general channels), and Corollaries 10 and 14 for classical-quantum channels; v1:22 pages; see also the independent work "Sampling complexity of quantum channel discrimination" DOI 10.1088/1572-9494/adcb9e
♻ ☆ Learning Satellite Attitude Dynamics with Physics-Informed Normalising Flow
Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error of the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, yielding improvements of up to 42.86% in performance stability error and increased robustness-to-noise.
♻ ☆ Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study
Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.
comment: 18 pages
♻ ☆ Beyond [cls]: Exploring the true potential of Masked Image Modeling representations
Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.
♻ ☆ Online Selective Generation with Adversarial Bandit Feedback
Large language generative models increasingly interact with humans, while their falsified responses raise concerns. To mitigate this hallucination effect, selectively abstaining from answering, called selective generation, provides an effective way for generators to control the hallucination when uncertain about their answers. However, as selective generators interact under adversarial environments and receive partial feedback from users on selected generation (e.g., thumbs up or down on the selected answer), learning methods for selective generation under such practical setups are crucial but currently missing. To address this limitation, we propose an online learning algorithm for selective generation with partial feedback under an adaptive adversary. In particular, we re-purpose an adversarial bandit algorithm to design an online selective generation method with controllable false discovery rates (FDR), which measures the rate of hallucination. The key building blocks include a novel conversion lemma from regret of any bandit algorithm to the FDR, and the exploitation of a unique structure of selective generation to reuse partial feedback, which we call feedback unlocking. We empirically evaluate the efficacy of the proposed online selective generation algorithm with partial feedback over diverse learning environments, demonstrating its ability to control the FDR, while maintaining reasonable selection efficiency, i.e., the ratio of non-abstaining answers, compared to baselines.
comment: 10 pages
♻ ☆ Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain NeurIPS 2025
Tactile sensing remains far less understood in neuroscience and less effective in artificial systems compared to more mature modalities such as vision and language. We bridge these gaps by introducing a novel Encoder-Attender-Decoder (EAD) framework to systematically explore the space of task-optimized temporal neural networks trained on realistic tactile input sequences from a customized rodent whisker-array simulator. We identify convolutional recurrent neural networks (ConvRNNs) as superior encoders to purely feedforward and state-space architectures for tactile categorization. Crucially, these ConvRNN-encoder-based EAD models achieve neural representations closely matching rodent somatosensory cortex, saturating the explainable neural variability and revealing a clear linear relationship between supervised categorization performance and neural alignment. Furthermore, contrastive self-supervised ConvRNN-encoder-based EADs, trained with tactile-specific augmentations, match supervised neural fits, serving as an ethologically-relevant, label-free proxy. For neuroscience, our findings highlight nonlinear recurrent processing as important for general-purpose tactile representations in somatosensory cortex, providing the first quantitative characterization of the underlying inductive biases in this system. For embodied AI, our results emphasize the importance of recurrent EAD architectures to handle realistic tactile inputs, along with tailored self-supervised learning methods for achieving robust tactile perception with the same type of sensors animals use to sense in unstructured environments.
comment: 10 pages, 8 figures, 7 tables, NeurIPS 2025 Camera Ready Version (oral)
♻ ☆ Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization
Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs. This capability can also serve as a fundamental mechanism for Self-Improvement, for example, by reconstructing datasets with refined results to enhance intrinsic model performance. However, our comprehensive experiments reveal that large language models (LLMs) show no clear evidence of inherent Self-Refinement and may even experience response quality degradation after Self-Refinement. To address this issue, we propose EVOLVE, a simple and effective framework for eliciting and tracking the evolution of Self-Refinement through iterative training. We first explore optimization methods during training to activate the model's Self-Refinement capability. Then, at inference, we investigate various generation strategies to further enhance and utilize Self-Refinement while supplying the necessary data for training. Through synergistic optimization of training and inference stages, we continually evolve the model's Self-Refinement ability, enabling it to better refine its own responses. Moreover, we demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities. Experiments show that the evolved Self-Refinement ability enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3% length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks, improving performance on mathematical reasoning benchmarks such as GSM8K and MATH.
♻ ☆ Personalized Bayesian Federated Learning with Wasserstein Barycenter Aggregation NIPS 2025
Personalized Bayesian federated learning (PBFL) handles non-i.i.d. client data and quantifies uncertainty by combining personalization with Bayesian inference. However, existing PBFL methods face two limitations: restrictive parametric assumptions in client posterior inference and naive parameter averaging for server aggregation. To overcome these issues, we propose FedWBA, a novel PBFL method that enhances both local inference and global aggregation. At the client level, we use particle-based variational inference for nonparametric posterior representation. At the server level, we introduce particle-based Wasserstein barycenter aggregation, offering a more geometrically meaningful approach. Theoretically, we provide local and global convergence guarantees for FedWBA. Locally, we prove a KL divergence decrease lower bound per iteration for variational inference convergence. Globally, we show that the Wasserstein barycenter converges to the true parameter as the client data size increases. Empirically, experiments show that FedWBA outperforms baselines in prediction accuracy, uncertainty calibration, and convergence rate, with ablation studies confirming its robustness.
comment: The paper has been accepted by NIPS 2025
♻ ☆ Re-uploading quantum data: A universal function approximator for quantum inputs
Quantum data re-uploading has proved powerful for classical inputs, where repeatedly encoding features into a small circuit yields universal function approximation. Extending this idea to quantum inputs remains underexplored, as the information contained in a quantum state is not directly accessible in classical form. We propose and analyze a quantum data re-uploading architecture in which a qubit interacts sequentially with fresh copies of an arbitrary input state. The circuit can approximate any bounded continuous function using only one ancilla qubit and single-qubit measurements. By alternating entangling unitaries with mid-circuit resets of the input register, the architecture realizes a discrete cascade of completely positive and trace-preserving maps, analogous to collision models in open quantum system dynamics. Our framework provides a qubit-efficient and expressive approach to designing quantum machine learning models that operate directly on quantum data.
comment: 24 pages, 11 figures
♻ ☆ Investigating Memory in RL with POPGym Arcade
How should we analyze memory in deep RL? We introduce mathematical tools for fairly analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated, pixel-based environments sharing a single observation and action space. Each environment provides fully and partially observable variants, enabling counterfactual studies on observability. We find that controlled studies are necessary for fair comparisons, and identify a pathology where value functions smear credit over irrelevant history. With this pathology, we demonstrate how out-of-distribution scenarios can contaminate memory, perturbing the policy far into the future, with implications for sim-to-real transfer and offline RL.
♻ ☆ Enhancing XAI Narratives through Multi-Narrative Refinement and Knowledge Distillation
Explainable Artificial Intelligence has become a crucial area of research, aiming to demystify the decision-making processes of deep learning models. Among various explainability techniques, counterfactual explanations have been proven particularly promising, as they offer insights into model behavior by highlighting minimal changes that would alter a prediction. Despite their potential, these explanations are often complex and technical, making them difficult for non-experts to interpret. To address this challenge, we propose a novel pipeline that leverages Language Models, large and small, to compose narratives for counterfactual explanations. We employ knowledge distillation techniques along with a refining mechanism to enable Small Language Models to perform comparably to their larger counterparts while maintaining robust reasoning abilities. In addition, we introduce a simple but effective evaluation method to assess natural language narratives, designed to verify whether the models' responses are in line with the factual, counterfactual ground truth. As a result, our proposed pipeline enhances both the reasoning capabilities and practical performance of student models, making them more suitable for real-world use cases.
♻ ☆ Multi-Modal Manipulation via Multi-Modal Policy Consensus
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.
comment: 9 pages, 7 figures. Project website: https://policyconsensus.github.io
♻ ☆ Failure Prediction at Runtime for Generative Robot Policies NeurIPS 2025
Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy's embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at https://tum-lsy.github.io/fiper_website.
comment: Project page: https://tum-lsy.github.io/fiper_website. 33 pages, 12 figures. Accepted to NeurIPS 2025
♻ ☆ From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .
♻ ☆ The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models NeurIPS 2025
Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and make it difficult to assess their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.
comment: NeurIPS 2025 Datasets and Benchmarks Track. Github link: https://github.com/TomSheng21/tta-vlm
♻ ☆ Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning
Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.
comment: 85 pages (11 + references and appendix), 9 figures. v2: added acknowledgements
♻ ☆ Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the 'grade' of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.
♻ ☆ Train-before-Test Harmonizes Language Model Rankings
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.
♻ ☆ PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning
In recent years, unlearning techniques, which are methods for inducing a model to "forget" previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.
♻ ☆ Superior Molecular Representations from Intermediate Encoder Layers
Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we first analyze the information flow in five diverse molecular encoders and find that intermediate layers retain more general-purpose features, whereas the final-layer specializes and compresses information. We then perform an empirical layer-wise evaluation across 22 property prediction tasks. We find that using frozen embeddings from optimal intermediate layers improves downstream performance by an average of 5.4%, up to 28.6%, compared to the final-layer. Furthermore, finetuning encoders truncated at intermediate depths achieves even greater average improvements of 8.5%, with increases as high as 40.8%, obtaining new state-of-the-art results on several benchmarks. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code will be made publicly available.
♻ ☆ On Experiments
The scientific process is a means to turn the results of experiments into knowledge about the world in which we live. Much research effort has been directed toward automating this process. To do this, one needs to formulate the scientific process in a precise mathematical language. This paper outlines one such language. What is presented here is hardly new. The material is based on great thinkers from times past well as more modern contributions. The novel contributions of this paper are: A new general data processing inequality, a bias variance decomposition for canonical losses, streamlined proofs of the Blackwell-Sherman-Stein and Randomization theorems. means of calculating deficiency through linear programming.
♻ ☆ Robustness and Regularization in Hierarchical Re-Basin
This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.
comment: Published in 32th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2024
♻ ☆ Agentic large language models improve retrieval-based radiology question answering
Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose radiology Retrieval and Reasoning (RaR), a multi-step retrieval and reasoning framework designed to improve diagnostic accuracy, factual consistency, and clinical reliability of LLMs in radiology question answering. We evaluated 25 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. To assess generalizability, we additionally tested on an unseen internal dataset of 65 real-world radiology board examination questions. RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG. The greatest gains occurred in small-scale models, while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, RaR retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models showed gains from RaR (e.g., MedGemma-27B), indicating that retrieval remains beneficial despite embedded domain knowledge. These results highlight the potential of RaR to enhance factuality and diagnostic accuracy in radiology QA, warranting future studies to validate their clinical utility. All datasets, code, and the full RaR framework are publicly available to support open research and clinical translation.
♻ ☆ Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios
Deep Reinforcement Learning is gaining increasing attention thanks to its capability to learn complex policies in high-dimensional settings. Recent advancements utilize a dual-network architecture to learn optimal policies through the Q-learning algorithm. However, this approach has notable drawbacks, such as an overestimation bias that can disrupt the learning process and degrade the performance of the resulting policy. To address this, novel algorithms have been developed that mitigate overestimation bias by employing multiple Q-functions. Edge scenarios, which prioritize privacy, have recently gained prominence. In these settings, limited computational resources pose a significant challenge for complex Machine Learning approaches, making the efficiency of algorithms crucial for their performance. In this work, we introduce a novel Reinforcement Learning algorithm tailored for edge scenarios, called Edge Delayed Deep Deterministic Policy Gradient (EdgeD3). EdgeD3 enhances the Deep Deterministic Policy Gradient (DDPG) algorithm, achieving significantly improved performance with $25\%$ less Graphics Process Unit (GPU) time while maintaining the same memory usage. Additionally, EdgeD3 consistently matches or surpasses the performance of state-of-the-art methods across various benchmarks, all while using $30\%$ fewer computational resources and requiring $30\%$ less memory.
♻ ☆ Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
♻ ☆ Simple and Effective Specialized Representations for Fair Classifiers
Fair classification is a critical challenge that has gained increasing importance due to international regulations and its growing use in high-stakes decision-making settings. Existing methods often rely on adversarial learning or distribution matching across sensitive groups; however, adversarial learning can be unstable, and distribution matching can be computationally intensive. To address these limitations, we propose a novel approach based on the characteristic function distance. Our method ensures that the learned representation contains minimal sensitive information while maintaining high effectiveness for downstream tasks. By utilizing characteristic functions, we achieve a more stable and efficient solution compared to traditional methods. Additionally, we introduce a simple relaxation of the objective function that guarantees fairness in common classification models with no performance degradation. Experimental results on benchmark datasets demonstrate that our approach consistently matches or achieves better fairness and predictive accuracy than existing methods. Moreover, our method maintains robustness and computational efficiency, making it a practical solution for real-world applications.
♻ ☆ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.
comment: 10 pages, 13 tables, 14 figures
♻ ☆ Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets
Achieving both accuracy and diverse reasoning remains challenging for Large Language Models (LLMs) in complex domains like mathematics. A key bottleneck is evaluating intermediate reasoning steps to guide generation without costly human annotations. To address this, we first introduce a novel Process Reward Model (PRM) trained automatically using Monte Carlo Tree Search coupled with a similarity-based data augmentation technique, effectively capturing step-level reasoning quality. Leveraging this PRM, we then adapt Generative Flow Networks (GFlowNets) to operate at the reasoning step level. Unlike traditional reinforcement learning focused on maximizing a single reward, GFlowNets naturally sample diverse, high-quality solutions proportional to their rewards, as measured by our PRM. Empirical evaluation shows strong improvements in both accuracy and solution diversity on challenging mathematical benchmarks (e.g., +2.59% absolute accuracy on MATH Level 5 for Llama3.2-3B), with effective generalization to unseen datasets (+9.4\% absolute on SAT MATH). Furthermore, we benchmark our PRM against existing open-source reward models, demonstrating superior alignment with reasoning quality and more consistent guidance for downstream generation. Our work demonstrates the potential of PRM-guided, step-level GFlowNets for developing more robust and versatile mathematical reasoning in LLMs.
♻ ☆ Incentivize Contribution and Learn Parameters Too: Federated Learning with Strategic Data Owners
Classical federated learning (FL) assumes that the clients have a limited amount of noisy data with which they voluntarily participate and contribute towards learning a global, more accurate model in a principled manner. The learning happens in a distributed fashion without sharing the data with the center. However, these methods do not consider the incentive of an agent for participating and contributing to the process, given that data collection and running a distributed algorithm is costly for the clients. The question of rationality of contribution has been asked recently in the literature and some results exist that consider this problem. This paper addresses the question of simultaneous parameter learning and incentivizing contribution in a truthful manner, which distinguishes it from the extant literature. Our first mechanism incentivizes each client to contribute to the FL process at a Nash equilibrium and simultaneously learn the model parameters. We also ensure that agents are incentivized to truthfully reveal information in the intermediate stages of the algorithm. However, this equilibrium outcome can be away from the optimal, where clients contribute with their full data and the algorithm learns the optimal parameters. We propose a second mechanism that enables the full data contribution along with optimal parameter learning. Large scale experiments with real (federated) datasets (CIFAR-10, FEMNIST, and Twitter) show that these algorithms converge quite fast in practice, yield good welfare guarantees and better model performance for all agents.
comment: 25 pages, under review
♻ ☆ TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models AAAI 2026
Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.
comment: Manuscript submitted to AAAI 2026, currently under review
♻ ☆ Scaling Equilibrium Propagation to Deeper Neural Network Architectures
Equilibrium propagation has been proposed as a biologically plausible alternative to the backpropagation algorithm. The local nature of gradient computations, combined with the use of convergent RNNs to reach equilibrium states, make this approach well-suited for implementation on neuromorphic hardware. However, previous studies on equilibrium propagation have been restricted to networks containing only dense layers or relatively small architectures with a few convolutional layers followed by a final dense layer. These networks have a significant gap in accuracy compared to similarly sized feedforward networks trained with backpropagation. In this work, we introduce the Hopfield-Resnet architecture, which incorporates residual (or skip) connections in Hopfield networks with clipped $\mathrm{ReLU}$ as the activation function. The proposed architectural enhancements enable the training of networks with nearly twice the number of layers reported in prior works. For example, Hopfield-Resnet13 achieves 93.92\% accuracy on CIFAR-10, which is $\approx$3.5\% higher than the previous best result and comparable to that provided by Resnet13 trained using backpropagation.
♻ ☆ Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
♻ ☆ Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling NeurIPS 2025
3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbf{U}nified Variational \textbf{A}uto-\textbf{E}ncoder for \textbf{3D} Molecular Latent Diffusion Modeling (\textbf{UAE-3D}), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer--a general-purpose diffusion model without any molecular inductive bias--for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textit{de novo} and conditional 3D molecule generation, achieving leading efficiency and quality. On GEOM-Drugs, it reduces FCD by 72.6\% over the previous best result, while achieving over 70\% relative average improvements in geometric fidelity. Our code is released at https://github.com/lyc0930/UAE-3D/.
comment: NeurIPS 2025
♻ ☆ Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size ACML2025
We theoretically analyzed the convergence behavior of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster convergence than using a constant batch size, not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. The convergence rate improves from $O(T^{-1}+C)$ with a constant batch size to $O(T^{-1})$ with an increasing batch size, where $T$ denotes the total number of iterations and $C$ is a constant. Using principal component analysis and low-rank matrix completion, we investigated, both theoretically and numerically, how an increasing batch size affects computational time as quantified by stochastic first-order oracle (SFO) complexity. An increasing batch size was found to reduce the SFO complexity of RSGD. Furthermore, an increasing batch size was found to offer the advantages of both small and large constant batch sizes.
comment: Accepted at ACML2025
♻ ☆ Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity
Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient "curl"-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.
♻ ☆ Learning Shared Representations from Unpaired Data NeurIPS 2025
Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our project page: https://shaham-lab.github.io/SUE_page.
comment: Accepted to NeurIPS 2025
♻ ☆ Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
comment: The code and model are available at https://github.com/M-E-AGI-Lab/Muddit
♻ ☆ Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study
Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, encouraging the model to learn representations invariant to the common underlying object across these views. While this approach facilitates the acquisition of invariant representations for dataset instances under various handcrafted transformations (e.g., random cropping, colour jittering), an exclusive reliance on such data transformations for achieving invariance may inherently limit the model's generalizability to unseen datasets and diverse downstream tasks. The inherent limitation stems from the fact that the finite set of transformations within the data processing pipeline is unable to encompass the full spectrum of potential data variations. In this study, we provide the technical foundation for leveraging semantic pairs to enhance the generalizability of the model's representation and empirically demonstrate that incorporating semantic pairs mitigates the issue of limited transformation coverage. Specifically, we propose that by exposing the model to semantic pairs (i.e., two instances belonging to the same semantic category), we introduce varied real-world scene contexts, thereby fostering the development of more generalizable object representations. To validate this hypothesis, we constructed and released a novel dataset comprising curated semantic pairs and conducted extensive experimentation to empirically establish that their inclusion enables the model to learn more general representations, ultimately leading to improved performance across diverse downstream tasks.
comment: 16 pages, 7 figures, 5 tables
♻ ☆ TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference
Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.
♻ ☆ Long-Range Graph Wavelet Networks NeurIPS 2025
Modeling long-range interactions, the propagation of information across distant parts of a graph, is a central challenge in graph machine learning. Graph wavelets, inspired by multi-resolution signal processing, provide a principled way to capture both local and global structures. However, existing wavelet-based graph neural networks rely on finite-order polynomial approximations, which limit their receptive fields and hinder long-range propagation. We propose Long-Range Graph Wavelet Networks (LR-GWN), which decompose wavelet filters into complementary local and global components. Local aggregation is handled with efficient low-order polynomials, while long-range interactions are captured through a flexible spectral-domain parameterization. This hybrid design unifies short- and long-distance information flow within a principled wavelet framework. Experiments show that LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks, while remaining competitive on short-range datasets.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: New Perspectives in Advancing Graph Machine Learning
♻ ☆ References Indeed Matter? Reference-Free Preference Optimization for Conversational Query Reformulation
Conversational query reformulation (CQR) has become indispensable for improving retrieval in dialogue-based applications. However, existing approaches typically rely on reference passages for optimization, which are impractical to acquire in real-world scenarios. To address this limitation, we introduce a novel reference-free preference optimization framework DualReform that generates pseudo reference passages from commonly-encountered conversational datasets containing only queries and responses. DualReform attains this goal through two key innovations: (1) response-based inference, where responses serve as proxies to infer pseudo reference passages, and (2) response refinement via the dual-role of CQR, where a CQR model refines responses based on the shared objectives between response refinement and CQR. Despite not relying on reference passages, DualReform achieves 96.9--99.1% of the retrieval accuracy attainable only with reference passages and surpasses the state-of-the-art method by up to 31.6%.
♻ ☆ AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits
Analog/Mixed-Signal (AMS) circuits play a critical role in the integrated circuit (IC) industry. However, automating Analog/Mixed-Signal (AMS) circuit design has remained a longstanding challenge due to its difficulty and complexity. Although recent advances in Multi-modal Large Language Models (MLLMs) offer promising potential for supporting AMS circuit analysis and design, current research typically evaluates MLLMs on isolated tasks within the domain, lacking a comprehensive benchmark that systematically assesses model capabilities across diverse AMS-related challenges. To address this gap, we introduce AMSbench, a benchmark suite designed to evaluate MLLM performance across critical tasks including circuit schematic perception, circuit analysis, and circuit design. AMSbench comprises approximately 8000 test questions spanning multiple difficulty levels and assesses eight prominent models, encompassing both open-source and proprietary solutions such as Qwen 2.5-VL and Gemini 2.5 Pro. Our evaluation highlights significant limitations in current MLLMs, particularly in complex multi-modal reasoning and sophisticated circuit design tasks. These results underscore the necessity of advancing MLLMs' understanding and effective application of circuit-specific knowledge, thereby narrowing the existing performance gap relative to human expertise and moving toward fully automated AMS circuit design workflows. Our data is released at this URL.
♻ ☆ FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel
Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. Github Repo at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.
♻ ☆ AB-UPT: Scaling Neural CFD Surrogates for High-Fidelity Automotive Aerodynamics Simulations via Anchored-Branched Universal Physics Transformers
Recent advances in neural surrogate modeling offer the potential for transformative innovations in applications such as automotive aerodynamics. Yet, industrial-scale problems often involve volumetric meshes with cell counts reaching 100 million, presenting major scalability challenges. Complex geometries further complicate modeling through intricate surface-volume interactions, while quantities such as vorticity are highly nonlinear and must satisfy strict divergence-free constraints. To address these requirements, we introduce AB-UPT as a novel modeling scheme for building neural surrogates for CFD simulations. AB-UPT is designed to: (i) decouple geometry encoding and prediction tasks via multi-branch operators; (ii) enable scalability to high-resolution outputs via neural simulation in a low-dimensional latent space, coupled with anchored neural field decoders to predict high-fidelity outputs; (iii) enforce physics consistency by a divergence-free formulation. We show that AB-UPT yields state-of-the-art predictive accuracy of surface and volume fields on automotive CFD simulations ranging from 33 thousand up to 150 million mesh cells. Furthermore, our anchored neural field architecture enables the enforcement of hard physical constraints on the physics predictions without degradation in performance, exemplified by modeling divergence-free vorticity fields. Notably, the proposed models can be trained on a single GPU in less than a day and predict industry-standard surface and volume fields within seconds. Additionally, we show that the flexible design of our method enables neural simulation from a CAD geometry alone, thereby eliminating the need for costly CFD meshing procedures for inference.
comment: Published in Transactions on Machine Learning Research
♻ ☆ On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions
Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a dynamic activation function called Dynamic Tanh (DyT). Although it is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between LN and dynamic activation functions. In particular, we derive DyT from the LN variant RMSNorm, and show that a well-defined decoupling in derivative space as well as an approximation are needed to do so. By applying the same decoupling procedure directly in function space, we are able to omit the approximation and obtain the exact element-wise counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit (DyISRU). We demonstrate numerically that DyISRU reproduces the normalization effect on outliers more accurately than DyT does.
comment: Revision and Simplification (starting point RMSNorm instead of LN)
♻ ☆ Early-Warning of Thunderstorm-Driven Power Outages with a Two-Stage Machine Learning Model
Thunderstorm-driven outages are difficult to predict because most storms do not cause damage, convective processes occur rapidly and chaotically, and the available public data are both noisy and incomplete. We develop a 24-48 h early-warning model for summer, thunderstorm-related outages in Michigan using only open sources (EAGLE-I for ground truth; METAR for weather). We use the publicly released EAGLE-I outage dataset (2014-2022), maintained by Oak Ridge National Laboratory for the U.S. Department of Energy. The pipeline preserves convective micro-signals from a sparse station network via parameter-specific kriging with hourly variograms and targeted overdrafting to retain extremes, and builds causal spatio-temporal features (lags/rolling statistics; k-NN/IDW spatial aggregates) capturing precursors of severe convection (moisture advection, wind shifts, and pressure drops). The two-stage model design, combining a logistic gate and an LSTM regressor, limits routine periods and reduces noise exposure. The study uses event-centric metrics (cluster-based hits/misses/false alarms) and peak-conditional MASE (cMASE) in +/-Delta-hour windows around state-level peaks (>= 50,000), with uncertainty quantified by hourly moving-block bootstrap. On the test sample, Two-Stage detects more reference peaks across all windows (e.g., at +/-48 h it records 3/4 vs. 2/4; F1 66.7% vs. 57.1%) with one extra false alarm. Near peaks, it shows modest amplitude gains (2-3% lower cMASE at +/-0-12 h; bootstrap medians +9-13% at +/-6-12 h) but small losses at +/-36-48 h (~3-4%). Overall, errors are comparable to the one-step LSTM baseline. SHAP analysis confirms moisture-advection and wind/gust precursors, underscoring the value of the feature engineering. Despite open-data noise, the feature-driven pipeline yields actionable, event-focused early warnings for thunderstorm outages.
comment: 23 pages (main), 70 pages incl. appendices; figures & tables as in manuscript. Code (main figure, synthetic data): https://github.com/IrynaStanishevska/peak-outage-forecasting- License: CC BY 4.0 (preprint)
♻ ☆ Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models NeurIPS 2025
Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.
comment: Oral at CCFM @ NeurIPS 2025
♻ ☆ VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
Aligning Vision-Language Models (VLMs) with safety standards is essential to mitigate risks arising from their multimodal complexity, where integrating vision and language unveils subtle threats beyond the reach of conventional safeguards. Inspired by the insight that reasoning across modalities is key to preempting intricate vulnerabilities, we propose a novel direction for VLM safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter, dynamically interpreting text-image interactions to deliver refined prompts that bolster safety across diverse VLM architectures without altering their core parameters. To achieve this, we devise a three-stage reasoning pipeline to synthesize a dataset that trains the rewriter to infer subtle threats, enabling tailored, actionable responses over generic refusals. Extensive experiments across three benchmarks with five VLMs reveal that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1 achieves a remarkable 43.59\% increase in average safety across five models on the SIUO benchmark.
♻ ☆ A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning
Defending backdoor attacks in Federated Learning (FL) under heterogeneous client data distributions encounters limitations balancing effectiveness and privacy-preserving, while most existing methods highly rely on the assumption of homogeneous client data distributions or the availability of a clean serve dataset. In this paper, we propose an FL backdoor defense framework, named CLIP-Fed, that utilizes the zero-shot learning capabilities of vision-language pre-training models. Our scheme overcomes the limitations of Non-IID imposed on defense effectiveness by integrating pre-aggregation and post-aggregation defense strategies. CLIP-Fed aligns the knowledge of the global model and CLIP on the augmented dataset using prototype contrastive loss and Kullback-Leibler divergence, so that class prototype deviations caused by backdoor samples are ensured and the correlation between trigger patterns and target labels is eliminated. In order to balance privacy-preserving and coverage enhancement of the dataset against diverse triggers, we further construct and augment the server dataset via using the multimodal large language model and frequency analysis without any client samples. Extensive experiments on representative datasets evidence the effectiveness of CLIP-Fed. Comparing to other existing methods, CLIP-Fed achieves an average reduction in Attack Success Rate, {\em i.e.}, 2.03\% on CIFAR-10 and 1.35\% on CIFAR-10-LT, while improving average Main Task Accuracy by 7.92\% and 0.48\%, respectively. Our codes are available at https://anonymous.4open.science/r/CLIP-Fed.
♻ ☆ Contrastive Representation Regularization for Vision-Language-Action Models
Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.
comment: 20 pages, 12 figures
♻ ☆ Structured Generative Modeling with the Thermodynamic Kolmogorov-Arnold Model
Learning an energy-based model (EBM) in the latent space of a top-down generative model offers a versatile framework for generation across multiple data modalities. However, it remains unclear how its interpretability can be used to guide model design, improve generative quality, and reduce training time. Moreover, the reliance on Langevin Monte Carlo (LMC) sampling presents challenges in efficiency and sampling multimodal latent distributions. In this work, we propose a novel adaptation of the Kolmogorov-Arnold representation theorem for generative modeling and introduce the Thermodynamic Kolmogorov-Arnold Model (T-KAM) to take advantage of structural and inductive biases. By constraining the prior to univariate relationships, T-KAM enables fast and exact inference via the inverse transform method. With the low dimensionality of the latent space and suitable inductive biases encoded, we demonstrate that importance sampling (IS) becomes a viable, unbiased, and highly efficient posterior sampler. For situations where IS fails, we investigate a novel strategy using population-based LMC, which decomposes posterior sampling into a sequence of annealed distributions to improve multimodal sampling. T-KAM elegantly balances common trade-offs in generative modeling, offering fast inference, interpretability, and stable training, while being naturally suited to upcoming Zettascale Computing Corp. hardware.
♻ ☆ TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning
Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO often rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address these limitations, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.
♻ ☆ Discrete-Time Diffusion-Like Models for Speech Synthesis
Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.
♻ ☆ Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL
We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as $\sqrt{L}$. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, with the largest gains on the Qwen3-8B-Base model.
♻ ☆ TranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems NeurIPS 2025
Regression models are crucial in recommender systems. However, retransformation bias problem has been conspicuously neglected within the community. While many works in other fields have devised effective bias correction methods, all of them are post-hoc cures externally to the model, facing practical challenges when applied to real-world recommender systems. Hence, we propose a preemptive paradigm to eradicate the bias intrinsically from the models via minor model refinement. Specifically, a novel TranSUN method is proposed with a joint bias learning manner to offer theoretically guaranteed unbiasedness under empirical superior convergence. It is further generalized into a novel generic regression model family, termed Generalized TranSUN (GTS), which not only offers more theoretical insights but also serves as a generic framework for flexibly developing various bias-free models. Comprehensive experimental results demonstrate the superiority of our methods across data from various domains, which have been successfully deployed in two real-world industrial recommendation scenarios, i.e. product and short video recommendation scenarios in Guess What You Like business domain in the homepage of Taobao App (a leading e-commerce platform with DAU > 300M), to serve the major online traffic.
comment: 30 pages, 7 figures, NeurIPS 2025 Poster
♻ ☆ LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.
♻ ☆ Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations
In this paper, we propose a continuous-time formulation for the AdaGrad, RMSProp, and Adam optimization algorithms by modeling them as first-order integro-differential equations. We perform numerical simulations of these equations, along with stability and convergence analyses, to demonstrate their validity as accurate approximations of the original algorithms. Our results indicate a strong agreement between the behavior of the continuous-time models and the discrete implementations, thus providing a new perspective on the theoretical understanding of adaptive optimization methods.
comment: 47 pages, 15 figures
♻ ☆ From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks
The proliferation of UAVs has enabled a wide range of mission-critical applications and is becoming a cornerstone of low-altitude networks, supporting smart cities, emergency response, and more. However, the open wireless environment, dynamic topology, and resource constraints of UAVs expose low-altitude networks to severe DoS threats. Traditional defense approaches, which rely on fixed configurations or centralized decision-making, cannot effectively respond to the rapidly changing conditions in UAV swarm environments. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive DoS mitigation in low-altitude networks. Specifically, we design lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process, capturing the uncertain nature of UAV swarms under attack. Each UAV is equipped with a policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based algorithm, UAVs collaboratively optimize their policies via reward-weighted aggregation. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, under various DoS attack strategies. These results highlight the potential of intelligent, distributed defense mechanisms to protect low-altitude networks, paving the way for reliable and scalable low-altitude economy.
comment: 15pages; Minor Revision for IEEE TCCN
♻ ☆ FedIA: A Plug-and-Play Importance-Aware Gradient Pruning Aggregation Method for Domain-Robust Federated Graph Learning on Node Classification
Federated Graph Learning (FGL) under domain skew -- as observed on platforms such as \emph{Twitch Gamers} and multilingual \emph{Wikipedia} networks -- drives client models toward incompatible representations, rendering naive aggregation both unstable and ineffective. We find that the culprit is not the weighting scheme but the \emph{noisy gradient signal}: empirical analysis of baseline methods suggests that a vast majority of gradient dimensions can be dominated by domain-specific variance. We therefore shift focus from "aggregation-first" to a \emph{projection-first} strategy that denoises client updates \emph{before} they are combined. The proposed FedIA framework realises this \underline{I}mportance-\underline{A}ware idea through a two-stage, plug-and-play pipeline: (i) a server-side top-$\rho$ mask keeps only the most informative about 5% of coordinates, and (ii) a lightweight influence-regularised momentum weight suppresses outlier clients. FedIA adds \emph{no extra uplink traffic and only negligible server memory}, making it readily deployable. On both homogeneous (Twitch Gamers) and heterogeneous (Wikipedia) graphs, it yields smoother, more stable convergence and higher final accuracy than nine strong baselines. A convergence sketch further shows that dynamic projection maintains the optimal $\mathcal{O}(\sigma^{2}/\sqrt{T})$ rate.
♻ ☆ Discovering and Reasoning of Causality in the Hidden World with Large Language Models NeurIPS'24
Revealing hidden causal variables alongside the underlying causal mechanisms is essential to the development of science. Despite the progress in the past decades, existing practice in causal discovery (CD) heavily relies on high-quality measured variables, which are usually given by human experts. In fact, the lack of well-defined high-level variables behind unstructured data has been a longstanding roadblock to a broader real-world application of CD. This procedure can naturally benefit from an automated process that can suggest potential hidden variables in the system. Interestingly, Large language models (LLMs) are trained on massive observations of the world and have demonstrated great capability in processing unstructured data. To leverage the power of LLMs, we develop a new framework termed Causal representatiOn AssistanT (COAT) that incorporates the rich world knowledge of LLMs to propose useful measured variables for CD with respect to high-value target variables on their paired unstructured data. Instead of directly inferring causality with LLMs, COAT constructs feedback from intermediate CD results to LLMs to refine the proposed variables. Given the target variable and the paired unstructured data, we first develop COAT-MB that leverages the predictivity of the proposed variables to iteratively uncover the Markov Blanket of the target variable. Built upon COAT-MB, COAT-PAG further extends to uncover a more complete causal graph, i.e., Partial Ancestral Graph, by iterating over the target variables and actively seeking new high-level variables. Moreover, the reliable CD capabilities of COAT also extend the debiased causal inference to unstructured data by discovering an adjustment set. We establish theoretical guarantees for the CD results and verify their efficiency and reliability across realistic benchmarks and real-world case studies.
comment: Extended version of our previous NeurIPS'24 conference paper (arXiv:2402.03941(2)); Chenxi and Yongqiang contributed equally; 78 pages, 44 figures; Project page: https://causalcoat.github.io/discovering-and-reasoning
♻ ☆ Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers
Machine Learning models in finance are highly susceptible to model drift, where predictive performance declines as data distributions shift. This issue is especially acute in developing economies such as those in Central Asia and the Caucasus - including Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan - where frequent and unpredictable macroeconomics shocks destabilize financial data. To the best of our knowledge, this is among the first studies to examine drift mitigation methods on financial datasets from these regions. We investigate the use of synthetic outliers, a largely unexplored approach, to improve model stability against unforeseen shocks. To evaluate effectiveness, we introduce a two-level framework that measures both the extent of performance degradation and the severity of shocks. Our experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model
♻ ☆ How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data
Rainfall nowcasting, which aims to predict precipitation within the next 0 to 3 hours, is critical for disaster mitigation and real-time response planning. However, most time series forecasting benchmarks in meteorology are evaluated on variables with strong periodicity, such as temperature and humidity, which fail to reflect model capabilities in more complex and practically meteorology scenarios like rainfall nowcasting. To address this gap, we propose RainfallBench, a benchmark designed for rainfall nowcasting, a highly challenging and practically relevant task characterized by zero inflation, temporal decay, and non-stationarity, focused on predicting precipitation within the next 0 to 3 hours. The dataset is derived from five years of meteorological observations, recorded at 15-minute intervals across six essential variables, and collected from more than 12,000 GNSS stations globally. In particular, it incorporates precipitable water vapor (PWV), a crucial indicator of rainfall that is absent in other datasets. We further design specialized evaluation strategies to assess model performance on key meteorological challenges, such as multi-scale prediction and extreme rainfall events, and evaluate over 20 state-of-the-art models across six major architectures on RainfallBench. Additionally, to address the zero-inflation and temporal decay issues overlooked by existing models, we introduce Bi-Focus Precipitation Forecaster (BFPF), a plug-and-play module that incorporates domain-specific priors to enhance rainfall time series forecasting. Statistical analysis and ablation studies validate the comprehensiveness of our dataset as well as the superiority of our methodology. Code and datasets are available at https://anonymous.4open.science/r/RainfallBench-A710.
comment: 11 pages,8 figures
♻ ☆ Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism
Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose \textbf{ParaStep}, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to \textbf{3.88}$\times$ on SVD, \textbf{2.43}$\times$ on CogVideoX-2b, and \textbf{6.56}$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.
♻ ☆ Revisiting Node Affinity Prediction in Temporal Graphs
Node affinity prediction is a common task that is widely used in temporal graph learning with applications in social and financial networks, recommender systems, and more. Recent works have addressed this task by adapting state-of-the-art dynamic link property prediction models to node affinity prediction. However, simple heuristics, such as Persistent Forecast or Moving Average, outperform these models. In this work, we analyze the challenges in training current Temporal Graph Neural Networks for node affinity prediction and suggest appropriate solutions. Combining the solutions, we develop NAViS - Node Affinity prediction model using Virtual State, by exploiting the equivalence between heuristics and state space models. While promising, training NAViS is non-trivial. Therefore, we further introduce a novel loss function for node affinity prediction. We evaluate NAViS on TGB and show that it outperforms the state-of-the-art, including heuristics. Our source code is available at https://github.com/orfeld415/NAVIS
comment: preprint
♻ ☆ Training and Evaluating with Human Label Variation: An Empirical Study
Human label variation (HLV) challenges the standard assumption that a labelled instance has a single ground truth, instead embracing the natural variation in human annotation to train and evaluate models. While various training methods and metrics for HLV have been proposed, it is still unclear which methods and metrics perform best in what settings. We propose new evaluation metrics for HLV leveraging fuzzy set theory. Since these new proposed metrics are differentiable, we then in turn experiment with employing these metrics as training objectives. We conduct an extensive study over 6 HLV datasets testing 14 training methods and 6 evaluation metrics. We find that training on either disaggregated annotations or soft labels performs best across metrics, outperforming training using the proposed training objectives with differentiable metrics. We also show that our proposed soft micro F1 score is one of the best metrics for HLV data.
comment: 27 pages, 7 figures. Accepted to CL. Pre-MIT Press publication version. Fixed PO-JSD values on the MFRC dataset. Completely redid the empirical meta-evaluation, added more related work, and other minor edits
♻ ☆ RL in the Wild: Characterizing RLVR Training in LLM Deployment
Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RLVR from a system perspective. To thoroughly understand the system challenges introduced by RLVR, we present a characterization study of RLVR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, and a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.
comment: 20 pages, 28 figures
♻ ☆ Joint Source-Environment Adaptation of Data-Driven Underwater Acoustic Source Ranging Based on Model Uncertainty
Adapting pre-trained deep learning models to new and unknown environments remains a major challenge in underwater acoustic localization. We show that although the performance of pre-trained models suffers from mismatch between the training and test data, they generally exhibit a higher uncertainty in environments where there is more mismatch. Additionally, in the presence of environmental mismatch, spurious peaks can appear in the output of classification-based localization approaches, which inspires us to define and use a method to quantify the "implied uncertainty" based on the number of model output peaks. Leveraging this notion of implied uncertainty, we partition the test samples into sets with more certain and less certain samples, and implement a method to adapt the model to new environments by using the certain samples to improve the labeling for uncertain samples, which helps to adapt the model. Thus, using this efficient method for model uncertainty quantification, we showcase an innovative approach to adapt a pre-trained model to unseen underwater environments at test time. This eliminates the need for labeled data from the target environment or the original training data. This adaptation is enhanced by integrating an independent estimate based on the received signal energy. We validate the approach extensively using real experimental data, as well as synthetic data consisting of model-generated signals with real ocean noise. The results demonstrate significant improvements in model prediction accuracy, underscoring the potential of the method to enhance underwater acoustic localization in diverse, noisy, and unknown environments.
♻ ☆ Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models. The code is available at https://github.com/mzhaoshuai/RefAlign.
comment: The code is at https://github.com/mzhaoshuai/RefAlign
♻ ☆ Goal-Based Vision-Language Driving
Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive's shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.
comment: 6 pages
♻ ☆ Learning Diffusion Models with Flexible Representation Guidance NeurIPS 2025
Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.
comment: NeurIPS 2025; Also Oral at ICML 2025 FM4LS workshop
♻ ☆ Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation NeurIPS 2025
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours. Our code is available at https://github.com/ruizheliUOA/ARC_JSD.
comment: Best Paper Award at COLM 2025 XLLM-Reason-Plan Workshop; Accepted at NeurIPS 2025 Mechanistic Interpretability Workshop
♻ ☆ Vision-Language Cross-Attention for Real-Time Autonomous Driving
Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.
comment: 5 pages
♻ ☆ QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.
♻ ☆ TabAttackBench: A Benchmark for Adversarial Attacks on Tabular Data
Adversarial attacks pose a significant threat to machine learning models by inducing incorrect predictions through imperceptible perturbations to input data. While these attacks are well studied in unstructured domains such as images, their behaviour on tabular data remains underexplored due to mixed feature types and complex inter-feature dependencies. This study introduces a comprehensive benchmark that evaluates adversarial attacks on tabular datasets with respect to both effectiveness and imperceptibility. We assess five white-box attack algorithms (FGSM, BIM, PGD, DeepFool, and C\&W) across four representative models (LR, MLP, TabTransformer and FT-Transformer) using eleven datasets spanning finance, energy, and healthcare domains. The benchmark employs four quantitative imperceptibility metrics (proximity, sparsity, deviation, and sensitivity) to characterise perturbation realism. The analysis quantifies the trade-off between these two aspects and reveals consistent differences between attack types, with $\ell_\infty$-based attacks achieving higher success but lower subtlety, and $\ell_2$-based attacks offering more realistic perturbations. The benchmark findings offer actionable insights for designing more imperceptible adversarial attacks, advancing the understanding of adversarial vulnerability in tabular machine learning.
comment: 71 pages, 21 figures, 11 tables
♻ ☆ Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts
Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.
♻ ☆ FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning
Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.
comment: 20 pages
♻ ☆ Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo
We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior $p(x)$, a measurement model $p(y|x)$, and are tasked with sampling from the posterior $p(x|y)$. Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general "tilting" problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.
♻ ☆ Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN
Graph Edit Distance (GED) is a fundamental graph similarity metric widely used in various applications. However, computing GED is an NP-hard problem. Recent state-of-the-art hybrid GED solver has shown promising performance by formulating GED as a bipartite graph matching problem, then leveraging a generative diffusion model to predict node matching between two graphs, from which both the GED and its corresponding edit path can be extracted using a traditional algorithm. However, such methods typically rely heavily on ground-truth supervision, where the ground-truth node matchings are often costly to obtain in real-world scenarios. In this paper, we propose GEDRanker, a novel unsupervised GAN-based framework for GED computation. Specifically, GEDRanker consists of a matching-based GED solver and introduces an interpretable preference-aware discriminator. By leveraging preference signals over different node matchings derived from edit path lengths, the discriminator can guide the matching-based solver toward generating high-quality node matching without the need for ground-truth supervision. Extensive experiments on benchmark datasets demonstrate that our GEDRanker enables the matching-based GED solver to achieve near-optimal solution quality without any ground-truth supervision.
♻ ☆ Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning
Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
♻ ☆ Adaptive UAV-Assisted Hierarchical Federated Learning: Optimizing Energy, Latency, and Resilience for Dynamic Smart IoT
Hierarchical Federated Learning (HFL) extends conventional Federated Learning (FL) by introducing intermediate aggregation layers, enabling distributed learning in geographically dispersed environments, particularly relevant for smart IoT systems, such as remote monitoring and battlefield operations, where cellular connectivity is limited. In these scenarios, UAVs serve as mobile aggregators, dynamically connecting terrestrial IoT devices. This paper investigates an HFL architecture with energy-constrained, dynamically deployed UAVs prone to communication disruptions. We propose a novel approach to minimize global training costs by formulating a joint optimization problem that integrates learning configuration, bandwidth allocation, and device-to-UAV association, ensuring timely global aggregation before UAV disconnections and redeployments. The problem accounts for dynamic IoT devices and intermittent UAV connectivity and is NP-hard. To tackle this, we decompose it into three subproblems: \textit{(i)} optimizing learning configuration and bandwidth allocation via an augmented Lagrangian to reduce training costs; \textit{(ii)} introducing a device fitness score based on data heterogeneity (via Kullback-Leibler divergence), device-to-UAV proximity, and computational resources, using a TD3-based algorithm for adaptive device-to-UAV assignment; \textit{(iii)} developing a low-complexity two-stage greedy strategy for UAV redeployment and global aggregator selection, ensuring efficient aggregation despite UAV disconnections. Experiments on diverse real-world datasets validate the approach, demonstrating cost reduction and robust performance under communication disruptions.
comment: Accepted by IEEE Transactions on Services Computing(22 pages, 11 figures)
♻ ☆ One-Stage Top-$k$ Learning-to-Defer: Score-Based Surrogates with Theoretical Guarantees
We introduce the first one-stage Top-$k$ Learning-to-Defer framework, which unifies prediction and deferral by learning a shared score-based model that selects the $k$ most cost-effective entities-labels or experts-per input. While existing one-stage L2D methods are limited to deferring to a single expert, our approach jointly optimizes prediction and deferral across multiple entities through a single end-to-end objective. We define a cost-sensitive loss and derive a novel convex surrogate that is independent of the cardinality parameter $k$, enabling generalization across Top-$k$ regimes without retraining. Our formulation recovers the Top-1 deferral policy of prior score-based methods as a special case, and we prove that our surrogate is both Bayes-consistent and $\mathcal{H}$-consistent under mild assumptions. We further introduce an adaptive variant, Top-$k(x)$, which dynamically selects the number of consulted entities per input to balance predictive accuracy and consultation cost. Experiments on CIFAR-10 and SVHN confirm that our one-stage Top-$k$ method strictly outperforms Top-1 deferral, while Top-$k(x)$ achieves superior accuracy-cost trade-offs by tailoring allocations to input complexity.
comment: Merged with another paper: 'Why Ask One When You Can Ask k? Learning-to-Defer to the Top-k Experts" (arXiv:2504.12988)
♻ ☆ Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning
Forecasting future links is a central task in temporal graph (TG) reasoning, requiring models to leverage historical interactions to predict upcoming ones. Traditional neural approaches, such as temporal graph neural networks, achieve strong performance but lack explainability and cannot be applied to unseen graphs without retraining. Recent studies have begun to explore using large language models (LLMs) for graph reasoning, but most of them are constrained to static graphs or small synthetic TGs and lack the evaluation of the quality of reasoning traces generated by LLMs. In this work, we present Reasoning-Enhanced Learning for Temporal Graphs (ReaL-TG), a reinforcement learning framework that fine-tunes LLMs to perform explainable link forecasting on real-world TGs. ReaL-TG uses outcome-based reward to encourage models to self-explore reasoning strategies from graph structure and to produce explanations that directly justify their predictions. To enable evaluation on LLM-generated reasoning traces, we propose a new evaluation protocol combining ranking metrics with an LLM-as-a-Judge system that assesses both the quality of reasoning and the impact of hallucinations. Experiments with ReaL-TG-4B, obtained by fine-tuning Qwen3-4B under our framework, show that it outperforms much larger frontier LLMs, including GPT-5 mini, on ranking metrics, while producing high-quality explanations confirmed by both the LLM judge and human evaluation.
♻ ☆ Can LLMs Handle WebShell Detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework
WebShell attacks, where malicious scripts are injected into web servers, pose a significant cybersecurity threat. Traditional ML and DL methods are often hampered by challenges such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models have emerged as powerful alternatives for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that, stemming from their distinct analytical strategies, larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all baseline models lag behind previous SOTA methods. With the application of BFAD, the performance of all LLMs improves significantly, yielding an average F1 score increase of 13.82%. Notably, larger models now outperform SOTA benchmarks, while smaller models such as Qwen-2.5-Coder-3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection and provides solutions to address the challenges in this task.
comment: Published as a conference paper at COLM 2025
♻ ☆ Steering LLMs for Formal Theorem Proving
Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for models to interpret them precisely. While existing methods achieve strong performance, little is known about how LLMs internally represent informal cues, or how these influence proof generation. To address this, we explore \textit{activation steering}, an inference-time intervention that identifies linear directions in residual activations associated with informal reasoning traces and adjusts them to improve proof construction without fine-tuning. This mechanism also yields interpretable information about how reasoning is internally encoded in the activation space of LLMs. We test our method for generating formal proofs from already-formalized theorems. Our contributions are twofold: (1) a novel activation-based intervention for guiding proof synthesis in LLMs; and (2) demonstration that this intervention improves performance under two decoding strategies (sampling and best-first search) without any further training.
♻ ☆ Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM NeurIPS 2025
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MoE model, our method speeds up the training by 2.4$\times$ with competitive performance. Codes will be released at https://github.com/ModelTC/HBP.
comment: Accepted in NeurIPS 2025
♻ ☆ FRIREN: Beyond Trajectories -- A Spectral Lens on Time NeurIPS 2025
Long-term time-series forecasting (LTSF) models are often presented as general-purpose solutions that can be applied across domains, implicitly assuming that all data is pointwise predictable. Using chaotic systems such as Lorenz-63 as a case study, we argue that geometric structure - not pointwise prediction - is the right abstraction for a dynamic-agnostic foundational model. Minimizing the Wasserstein-2 distance (W2), which captures geometric changes, and providing a spectral view of dynamics are essential for long-horizon forecasting. Our model, FRIREN (Flow-inspired Representations via Interpretable Eigen-networks), implements an augmented normalizing-flow block that embeds data into a normally distributed latent representation. It then generates a W2-efficient optimal path that can be decomposed into rotation, scaling, inverse rotation, and translation. This architecture yields locally generated, geometry-preserving predictions that are independent of the underlying dynamics, and a global spectral representation that functions as a finite Koopman operator with a small modification. This enables practitioners to identify which modes grow, decay, or oscillate, both locally and system-wide. FRIREN achieves an MSE of 11.4, MAE of 1.6, and SWD of 0.96 on Lorenz-63 in a 336-in, 336-out, dt=0.01 setting, surpassing TimeMixer (MSE 27.3, MAE 2.8, SWD 2.1). The model maintains effective prediction for 274 out of 336 steps, approximately 2.5 Lyapunov times. On Rossler (96-in, 336-out), FRIREN achieves an MSE of 0.0349, MAE of 0.0953, and SWD of 0.0170, outperforming TimeMixer's MSE of 4.3988, MAE of 0.886, and SWD of 3.2065. FRIREN is also competitive on standard LTSF datasets such as ETT and Weather. By connecting modern generative flows with classical spectral analysis, FRIREN makes long-term forecasting both accurate and interpretable, setting a new benchmark for LTSF model design.
comment: 37 pages, 4 figures. Submitted to NeurIPS 2025. Public code at https://anonymous.4open.science/r/LTSF_model-03BB/
♻ ☆ Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models
Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.
♻ ☆ Online Auction Design Using Distribution-Free Uncertainty Quantification with Applications to E-Commerce
Online auction is a cornerstone of e-commerce, and a key challenge is designing incentive-compatible mechanisms that maximize expected revenue. Existing approaches often assume known bidder value distributions and fixed sets of bidders and items, but these assumptions rarely hold in real-world settings where bidder values are unknown, and the number of future participants is uncertain. In this paper, we introduce the Conformal Online Auction Design (COAD), a novel mechanism that maximizes revenue by quantifying uncertainty in bidder values without relying on known distributions. COAD incorporates both bidder and item features, using historical data to design an incentive-compatible mechanism for online auctions. Unlike traditional methods, COAD leverages distribution-free uncertainty quantification techniques and integrates machine learning methods, such as random forests, kernel methods, and deep neural networks, to predict bidder values while ensuring revenue guarantees. Moreover, COAD introduces bidder-specific reserve prices, based on the lower confidence bounds of bidder valuations, contrasting with the single reserve prices commonly used in the literature. We demonstrate the practical effectiveness of COAD through an application to real-world eBay auction data. Theoretical results and extensive simulation studies further validate the properties of our approach.
♻ ☆ Generating Multi-Image Synthetic Data for Text-to-Image Customization ICCV 2025
Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.
comment: ICCV 2025. Project webpage: https://www.cs.cmu.edu/~syncd-project/
♻ ☆ A Comprehensive Evaluation on Quantization Techniques for Large Language Models
For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.
♻ ☆ TetriServe: Efficient DiT Serving for Heterogeneous Image Generation
Diffusion Transformer (DiT) models excel at generating highquality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at large resolutions. Existing serving systems use fixed degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the parallel degree of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment: (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimize GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.
♻ ☆ Precise Mobile Manipulation of Small Everyday Objects
Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments. Naively doing so fails because of occlusion by the end-effector. SVM mitigates this using vision models that out-paint the end-effector, thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module for SVM to seek semantic targets (e.g. knobs) and point tracking methods can help SVM reliably pursue interaction sites indicated by user clicks. We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances. SVM obtains a 71% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method by an absolute 42% and an imitation learning baseline trained on 1000+ demonstrations also by an absolute success rate of 50%.
comment: Project webpage: https://arjung128.github.io/svm
♻ ☆ Posterior Sampling for Continuing Environments
We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
comment: RLC 2024
♻ ☆ SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space
The rapid growth of digital pathology and advances in self-supervised deep learning have enabled the development of foundational models for various pathology tasks across diverse diseases. While multimodal approaches integrating diverse data sources have emerged, a critical gap remains in the comprehensive integration of whole-slide images (WSIs) with spatial transcriptomics (ST), which is crucial for capturing critical molecular heterogeneity beyond standard hematoxylin & eosin (H&E) staining. We introduce SPADE, a foundation model that integrates histopathology with ST data to guide image representation learning within a unified framework, in effect creating an ST-informed latent space. SPADE leverages a mixture-of-data experts technique, where experts are created via two-stage imaging feature-space clustering using contrastive learning to learn representations of co-registered WSI patches and gene expression profiles. Pre-trained on the comprehensive HEST-1k dataset, SPADE is evaluated on 20 downstream tasks, demonstrating significantly superior few-shot performance compared to baseline models, highlighting the benefits of integrating morphological and molecular information into one latent space. Code and pretrained weights are available at https://github.com/uclabair/SPADE.
♻ ☆ Agent Learning via Early Experience
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
comment: Work in progress
♻ ☆ MergeBench: A Benchmark for Merging Domain-Specialized LLMs NeurIPS 2025
Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at \href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}.
comment: NeurIPS 2025 Track on Datasets and Benchmarks
♻ ☆ Scalable Policy-Based RL Algorithms for POMDPs NeurIPS 2025
The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
comment: 36 pages, 3 Figures, Accepted at NeurIPS 2025
♻ ☆ Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification
The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. In the presented framework, to estimate the density it is necessary to know the ID, therefore, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.
♻ ☆ K-ASTRO: Structure-Aware Adaptation of LLMs for Code Vulnerability Detection
Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.
♻ ☆ Neon: Negative Extrapolation From Self-Training Improves Image Generation
Scaling generative AI models is bottlenecked by the scarcity of high-quality training data. The ease of synthesizing from a generative model suggests using (unverified) synthetic data to augment a limited corpus of real data for the purpose of fine-tuning in the hope of improving performance. Unfortunately, however, the resulting positive feedback loop leads to model autophagy disorder (MAD, aka model collapse) that results in a rapid degradation in sample quality and/or diversity. In this paper, we introduce Neon (for Negative Extrapolation frOm self-traiNing), a new learning method that turns the degradation from self-training into a powerful signal for self-improvement. Given a base model, Neon first fine-tunes it on its own self-synthesized data but then, counterintuitively, reverses its gradient updates to extrapolate away from the degraded weights. We prove that Neon works because typical inference samplers that favor high-probability regions create a predictable anti-alignment between the synthetic and real data population gradients, which negative extrapolation corrects to better align the model with the true data distribution. Neon is remarkably easy to implement via a simple post-hoc merge that requires no new real data, works effectively with as few as 1k synthetic samples, and typically uses less than 1% additional training compute. We demonstrate Neon's universality across a range of architectures (diffusion, flow matching, autoregressive, and inductive moment matching models) and datasets (ImageNet, CIFAR-10, and FFHQ). In particular, on ImageNet 256x256, Neon elevates the xAR-L model to a new state-of-the-art FID of 1.02 with only 0.36% additional training compute. Code is available at https://github.com/VITA-Group/Neon
♻ ☆ Panda: A pretrained forecast model for chaotic dynamics
Chaotic systems are intrinsically sensitive to small errors, challenging efforts to construct predictive data-driven models of real-world dynamical systems such as fluid flows or neuronal activity. Prior efforts comprise either specialized models trained separately on individual time series, or foundation models trained on vast time series databases with little underlying dynamical structure. Motivated by dynamical systems theory, we present Panda, Patched Attention for Nonlinear Dynamics. We train Panda on a novel synthetic, extensible dataset of $2 \times 10^4$ chaotic dynamical systems that we discover using an evolutionary algorithm. Trained purely on simulated data, Panda exhibits emergent properties: zero-shot forecasting of unseen chaotic systems preserving both short-term accuracy and long-term statistics. Despite having been trained only on low-dimensional ordinary differential equations, Panda spontaneously develops the ability to predict partial differential equations without retraining. We also demonstrate a neural scaling law for differential equations, underscoring the potential of pretrained models for probing abstract mathematical domains like nonlinear dynamics.
♻ ☆ COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2% and 56.8% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.
♻ ☆ Evaluating multiple models using labeled and unlabeled data NeurIPS 2025
It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.
comment: To appear at NeurIPS 2025
♻ ☆ Multi-View Majority Vote Learning Algorithms: Direct Minimization of PAC-Bayesian Bounds
The PAC-Bayesian framework has significantly advanced the understanding of statistical learning, particularly for majority voting methods. Despite its successes, its application to multi-view learning -- a setting with multiple complementary data representations -- remains underexplored. In this work, we extend PAC-Bayesian theory to multi-view learning, introducing novel generalization bounds based on R\'enyi divergence. These bounds provide an alternative to traditional Kullback-Leibler divergence-based counterparts, leveraging the flexibility of R\'enyi divergence. Furthermore, we propose first- and second-order oracle PAC-Bayesian bounds and extend the C-bound to multi-view settings. To bridge theory and practice, we design efficient self-bounding optimization algorithms that align with our theoretical results.
♻ ☆ Generative Deep Learning Framework for Inverse Design of Fuels
In the present work, a generative deep learning framework combining a Co-optimized Variational Autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques is developed to enable accelerated inverse design of fuels. The Co-VAE integrates a property prediction component coupled with the VAE latent space, enhancing molecular reconstruction and accurate estimation of Research Octane Number (RON) (chosen as the fuel property of interest). A subset of the GDB-13 database, enriched with a curated RON database, is used for model training. Hyperparameter tuning is further utilized to optimize the balance among reconstruction fidelity, chemical validity, and RON prediction. An independent regression model is then used to refine RON prediction, while a differential evolution algorithm is employed to efficiently navigate the VAE latent space and identify promising fuel molecule candidates with high RON. This methodology addresses the limitations of traditional fuel screening approaches by capturing complex structure-property relationships within a comprehensive latent representation. The generative model can be adapted to different target properties, enabling systematic exploration of large chemical spaces relevant to fuel design applications. Furthermore, the demonstrated framework can be readily extended by incorporating additional synthesizability criteria to improve applicability and reliability for de novo design of new fuels.
♻ ☆ A Statistical Theory of Contrastive Learning via Approximate Sufficient Statistics
Contrastive learning -- a modern approach to extract useful representations from unlabeled data by training models to distinguish similar samples from dissimilar ones -- has driven significant progress in foundation models. In this work, we develop a new theoretical framework for analyzing data augmentation-based contrastive learning, with a focus on SimCLR as a representative example. Our approach is based on the concept of \emph{approximate sufficient statistics}, which we extend beyond its original definition in \cite{oko2025statistical} for contrastive language-image pretraining (CLIP) using KL-divergence. We generalize it to equivalent forms and general f-divergences, and show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient. Furthermore, we demonstrate that these near-sufficient encoders can be effectively adapted to downstream regression and classification tasks, with performance depending on their sufficiency and the error induced by data augmentation in contrastive learning. Concrete examples in linear regression and topic classification are provided to illustrate the broad applicability of our results.
♻ ☆ LLMBridge: Reducing Costs in a Prompt-Centric Internet
Today's Internet infrastructure is centered around content retrieval over HTTP, with middleboxes (e.g., HTTP proxies) playing a crucial role in performance, security, and cost-effectiveness. We envision a future where Internet communication will be dominated by "prompts" sent to generative AI models. For this, we will need proxies that provide similar functions to HTTP proxies (e.g., caching, routing, compression) while dealing with unique challenges and opportunities of prompt-based communication. As a first step toward supporting prompt-based communication, we present LLMBridge, an LLM proxy designed for cost-conscious users, such as those in developing regions and education (e.g., students, instructors). LLMBridge supports three key optimizations: model selection (routing prompts to the most suitable model), context management (intelligently reducing the amount of context), and semantic caching (serving prompts using local models and vector databases). These optimizations introduce trade-offs between cost and quality, which applications navigate through a high-level, bidirectional interface. As case studies, we deploy LLMBridge in two cost-sensitive settings: a WhatsApp-based Q&A service and a university classroom environment. The WhatsApp service has been live for over twelve months, serving 100+ users and handling more than 14.7K requests. In parallel, we exposed LLMBridge to students across three computer science courses over a semester, where it supported diverse LLM-powered applications - such as reasoning agents and chatbots - and handled an average of 500 requests per day. We report on deployment experiences across both settings and use the collected workloads to benchmark the effectiveness of various cost-optimization strategies, analyzing their trade-offs in cost, latency, and response quality.
♻ ☆ Newton-Puiseux Analysis for Interpretability and Calibration of Complex-Valued Neural Networks
Complex-valued neural networks (CVNNs) are particularly suitable for handling phase-sensitive signals, including electrocardiography (ECG), radar/sonar, and wireless in-phase/quadrature (I/Q) streams. Nevertheless, their \emph{interpretability} and \emph{probability calibration} remain insufficiently investigated. In this work, we present a Newton--Puiseux framework that examines the \emph{local decision geometry} of a trained CVNN by (i) fitting a small, kink-aware polynomial surrogate to the \emph{logit difference} in the vicinity of uncertain inputs, and (ii) factorizing this surrogate using Newton--Puiseux expansions to derive analytic branch descriptors, including exponents, multiplicities, and orientations. These descriptors provide phase-aligned directions that induce class flips in the original network and allow for a straightforward, \emph{multiplicity-guided} temperature adjustment for improved calibration. We outline assumptions and diagnostic measures under which the surrogate proves informative and characterize potential failure modes arising from piecewise-holomorphic activations (e.g., modReLU). Our phase-aware analysis identifies sensitive directions and enhances Expected Calibration Error in two case studies beyond a controlled $\C^2$ synthetic benchmark -- namely, the MIT--BIH arrhythmia (ECG) dataset and RadioML 2016.10a (wireless modulation) -- when compared to uncalibrated softmax and standard post-hoc baselines. We also present confidence intervals, non-parametric tests, and quantify sensitivity to inaccuracies in estimating branch multiplicity. Crucially, this method requires no modifications to the architecture and applies to any CVNN with complex logits transformed to real moduli.
comment: Journal pre-proof available. Please cite the journal version: Neural Networks (2025), doi:10.1016/j.neunet.2025.108172
♻ ☆ Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games
A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of $T^{-1}$ has long been established since Nemirovski's mirror-prox algorithm and Nesterov's excessive gap technique in the early 2000s, the most effective paradigm in practice is *counterfactual regret minimization*, which is based on *regret matching* and its modern variants. In particular, the state of the art across most benchmarks is *predictive* regret matching$^+$ (PRM$^+$), in conjunction with non-uniform averaging. Yet, such algorithms can exhibit slower $\Omega(T^{-1/2})$ convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM$^+$, which we call IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$ (i.e., optimal) average-iterate convergence guarantees, while also being on par with PRM$^+$ on benchmark games. From a technical standpoint, we draw an analogy between IREG-PRM$^+$ and optimistic gradient descent with *adaptive* learning rate. The basic flaw of PRM$^+$ is that the ($\ell_2$-)norm of the regret vector -- which can be thought of as the inverse of the learning rate -- can decrease. By contrast, we design IREG-PRM$^+$ so as to maintain the invariance that the norm of the regret vector is nondecreasing. This enables us to derive an RVU-type bound for IREG-PRM$^+$, the first such property that does not rely on introducing additional hyperparameters to enforce smoothness. Furthermore, we find that IREG-PRM$^+$ performs on par with an adaptive version of optimistic gradient descent that we introduce whose learning rate depends on the misprediction error, demystifying the effectiveness of the regret matching family *vis-a-vis* more standard optimization techniques.
♻ ☆ Pattern-based Knowledge Component Extraction from Student Code Using Representation Learning
Effective personalized learning in computer science education depends on accurately modeling what students know and what they need to learn. While Knowledge Components (KCs) provide a foundation for such modeling, automated KC extraction from student code is inherently challenging due to insufficient explainability of discovered KCs and the open-endedness of programming problems with significant structural variability across student solutions and complex interactions among programming concepts. In this work, we propose a novel, explainable framework for automated KC discovery through pattern-based KCs: recurring structural patterns within student code that capture the specific programming patterns and language constructs that students must master. Toward this, we train a Variational Autoencoder to generate important representative patterns from student code guided by an explainable, attention-based code representation model that identifies important correct and incorrect pattern implementations from student code. These patterns are then clustered to form pattern-based KCs. We evaluate our KCs using two well-established methods informed by Cognitive Science: learning curve analysis and Deep Knowledge Tracing (DKT). Experimental results demonstrate meaningful learning trajectories and significant improvements in DKT predictive performance over traditional KT methods. This work advances knowledge modeling in CS education by providing an automated, scalable, and explainable framework for identifying granular code patterns and algorithmic constructs, essential for student learning.
♻ ☆ AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Laws
Neural scaling laws are observed in a range of domains, to date with no universal understanding of why they occur. Recent theories suggest that loss power laws arise from Zipf's law, a power law observed in domains like natural language. One theory suggests that language scaling laws emerge when Zipf-distributed task quanta are learned in descending order of frequency. In this paper we examine power-law scaling in AlphaZero, a reinforcement learning algorithm, using a model of language-model scaling. We find that game states in training and inference data scale with Zipf's law, which is known to arise from the tree structure of the environment, and examine the correlation between scaling-law and Zipf's-law exponents. In agreement with the quanta scaling model, we find that agents optimize state loss in descending order of frequency, even though this order scales inversely with modelling complexity. We also find that inverse scaling, the failure of models to improve with size, is correlated with unusual Zipf curves where end-game states are among the most frequent states. We show evidence that larger models shift their focus to these less-important states, sacrificing their understanding of important early-game states.
♻ ☆ Physics-Based Machine Learning Closures and Wall Models for Hypersonic Transition-Continuum Boundary Layer Predictions
Modeling rarefied hypersonic flows remains a fundamental challenge due to the breakdown of classical continuum assumptions in the transition-continuum regime, where the Knudsen number ranges from approximately 0.1 to 10. Conventional Navier-Stokes-Fourier (NSF) models with empirical slip-wall boundary conditions fail to accurately predict nonequilibrium effects such as velocity slip, temperature jump, and shock structure deviations. We develop a physics-constrained machine learning framework that augments transport models and boundary conditions to extend the applicability of continuum solvers in nonequilibrium hypersonic regimes. We employ deep learning PDE models (DPMs) for the viscous stress and heat flux embedded in the governing PDEs and trained via adjoint-based optimization. We evaluate these for two-dimensional supersonic flat-plate flows across a range of Mach and Knudsen numbers. Additionally, we introduce a wall model based on a mixture of skewed Gaussian approximations of the particle velocity distribution function. This wall model replaces empirical slip conditions with physically informed, data-driven boundary conditions for the streamwise velocity and wall temperature. Our results show that a trace-free anisotropic viscosity model, paired with the skewed-Gaussian distribution function wall model, achieves significantly improved accuracy, particularly at high-Mach and high-Knudsen number regimes. Strategies such as parallel training across multiple Knudsen numbers and inclusion of high-Mach data during training are shown to enhance model generalization. Increasing model complexity yields diminishing returns for out-of-sample cases, underscoring the need to balance degrees of freedom and overfitting. This work establishes data-driven, physics-consistent strategies for improving hypersonic flow modeling for regimes in which conventional continuum approaches are invalid.
♻ ☆ OptiFLIDS: Optimized Federated Learning for Energy-Efficient Intrusion Detection in IoT
In critical IoT environments, such as smart homes and industrial systems, effective Intrusion Detection Systems (IDS) are essential for ensuring security. However, developing robust IDS solutions remains a significant challenge. Traditional machine learning-based IDS models typically require large datasets, but data sharing is often limited due to privacy and security concerns. Federated Learning (FL) presents a promising alternative by enabling collaborative model training without sharing raw data. Despite its advantages, FL still faces key challenges, such as data heterogeneity (non-IID data) and high energy and computation costs, particularly for resource constrained IoT devices. To address these issues, this paper proposes OptiFLIDS, a novel approach that applies pruning techniques during local training to reduce model complexity and energy consumption. It also incorporates a customized aggregation method to better handle pruned models that differ due to non-IID data distributions. Experiments conducted on three recent IoT IDS datasets, TON_IoT, X-IIoTID, and IDSIoT2024, demonstrate that OptiFLIDS maintains strong detection performance while improving energy efficiency, making it well-suited for deployment in real-world IoT environments.
comment: 12 pages, 15 figures
♻ ☆ ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization NeurIPS 2025
The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
comment: NeurIPS 2025. Model weights are available at https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95
♻ ☆ Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning
We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $\epsilon$ and the policy variance $\sigma^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.
comment: 48 pages, 3 figures, 4 tables
♻ ☆ Leveraging Predictive Equivalence in Decision Trees ICML 2025
Decision trees are widely used for interpretable machine learning due to their clearly structured reasoning process. However, this structure belies a challenge we refer to as predictive equivalence: a given tree's decision boundary can be represented by many different decision trees. The presence of models with identical decision boundaries but different evaluation processes makes model selection challenging. The models will have different variable importance and behave differently in the presence of missing values, but most optimization procedures will arbitrarily choose one such model to return. We present a boolean logical representation of decision trees that does not exhibit predictive equivalence and is faithful to the underlying decision boundary. We apply our representation to several downstream machine learning tasks. Using our representation, we show that decision trees are surprisingly robust to test-time missingness of feature values; we address predictive equivalence's impact on quantifying variable importance; and we present an algorithm to optimize the cost of reaching predictions.
comment: Accepted to ICML 2025
♻ ☆ BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping
Benthic habitat mapping is fundamental for understanding marine ecosystems, guiding conservation efforts, and supporting sustainable resource management. Yet, the scarcity of large, annotated datasets limits the development and benchmarking of machine learning models in this domain. This paper introduces a thorough multi-modal dataset, comprising about a million side-scan sonar (SSS) tiles collected along the coast of Catalonia (Spain), complemented by bathymetric maps and a set of co-registered optical images from targeted surveys using an autonomous underwater vehicle (AUV). Approximately \num{36000} of the SSS tiles have been manually annotated with segmentation masks to enable supervised fine-tuning of classification models. All the raw sensor data, together with mosaics, are also released to support further exploration and algorithm development. To address challenges in multi-sensor data fusion for AUVs, we spatially associate optical images with corresponding SSS tiles, facilitating self-supervised, cross-modal representation learning. Accompanying open-source preprocessing and annotation tools are provided to enhance accessibility and encourage research. This resource aims to establish a standardized benchmark for underwater habitat mapping, promoting advancements in autonomous seafloor classification and multi-sensor integration.
comment: Article under review by IJRR
♻ ☆ Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models
The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.
comment: Accepted to EMNLP 2025
♻ ☆ Speech Enhancement and Dereverberation with Diffusion-based Generative Models
In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse.
comment: Proofread version
♻ ☆ A Set-Sequence Model for Time Series ICLR 2025
Many prediction problems across science and engineering, especially in finance and economics, involve large cross-sections of individual time series, where each unit (e.g., a loan, stock, or customer) is driven by unit-level features and latent cross-sectional dynamics. While sequence models have advanced per-unit temporal prediction, capturing cross-sectional effects often still relies on hand-crafted summary features. We propose Set-Sequence, a model that learns cross-sectional structure directly, enhancing expressivity and eliminating manual feature engineering. At each time step, a permutation-invariant Set module summarizes the unit set; a Sequence module then models each unit's dynamics conditioned on both its features and the learned summary. The architecture accommodates unaligned series, supports varying numbers of units at inference, integrates with standard sequence backbones (e.g., Transformers), and scales linearly in cross-sectional size. Across a synthetic contagion task and two large-scale real-world applications, equity portfolio optimization and loan risk prediction, Set-Sequence significantly outperforms strong baselines, delivering higher Sharpe ratios, improved AUCs, and interpretable cross-sectional summaries.
comment: Presented at the Workshop on Financial AI at ICLR 2025
♻ ☆ Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection
Camouflaged objects that blend into natural scenes pose significant challenges for deep-learning models to detect and synthesize. While camouflaged object detection is a crucial task in computer vision with diverse real-world applications, this research topic has been constrained by limited data availability. We propose a framework for synthesizing camouflage data to enhance the detection of camouflaged objects in natural scenes. Our approach employs a generative model to produce realistic camouflage images, which can be used to train existing object detection models. Specifically, we use a camouflage environment generator supervised by a camouflage distribution classifier to synthesize the camouflage images, which are then fed into our generator to expand the dataset. Our framework outperforms the current state-of-the-art method on three datasets (COD10k, CAMO, and CHAMELEON), demonstrating its effectiveness in improving camouflaged object detection. This approach can serve as a plug-and-play data generation and augmentation module for existing camouflaged object detection tasks and provides a novel way to introduce more diversity and distributions into current camouflage datasets.
♻ ☆ Codiscovering graphical structure and functional relationships within data: A Gaussian Process framework for connecting the dots
Most problems within and beyond the scientific domain can be framed into one of the following three levels of complexity of function approximation. Type 1: Approximate an unknown function given input/output data. Type 2: Consider a collection of variables and functions, some of which are unknown, indexed by the nodes and hyperedges of a hypergraph (a generalized graph where edges can connect more than two vertices). Given partial observations of the variables of the hypergraph (satisfying the functional dependencies imposed by its structure), approximate all the unobserved variables and unknown functions. Type 3: Expanding on Type 2, if the hypergraph structure itself is unknown, use partial observations of the variables of the hypergraph to discover its structure and approximate its unknown functions. These hypergraphs offer a natural platform for organizing, communicating, and processing computational knowledge. While most scientific problems can be framed as the data-driven discovery of unknown functions in a computational hypergraph whose structure is known (Type 2), many require the data-driven discovery of the structure (connectivity) of the hypergraph itself (Type 3). We introduce an interpretable Gaussian Process (GP) framework for such (Type 3) problems that does not require randomization of the data, access to or control over its sampling, or sparsity of the unknown functions in a known or learned basis. Its polynomial complexity, which contrasts sharply with the super-exponential complexity of causal inference methods, is enabled by the nonlinear ANOVA capabilities of GPs used as a sensing mechanism.
comment: PNAS (2024). This article was formerly titled: Computational Hypergraph Discovery, a Gaussian Process framework for connecting the dots. The code for the algorithm introduced in this paper and its application to various examples are available for download (and as as an installable python library/package) at https://github.com/TheoBourdais/ComputationalHypergraphDiscovery
Computer Vision and Pattern Recognition 98
☆ From Detection to Mitigation: Addressing Bias in Deep Learning Models for Chest X-Ray Diagnosis
Deep learning models have shown promise in improving diagnostic accuracy from chest X-rays, but they also risk perpetuating healthcare disparities when performance varies across demographic groups. In this work, we present a comprehensive bias detection and mitigation framework targeting sex, age, and race-based disparities when performing diagnostic tasks with chest X-rays. We extend a recent CNN-XGBoost pipeline to support multi-label classification and evaluate its performance across four medical conditions. We show that replacing the final layer of CNN with an eXtreme Gradient Boosting classifier improves the fairness of the subgroup while maintaining or improving the overall predictive performance. To validate its generalizability, we apply the method to different backbones, namely DenseNet-121 and ResNet-50, and achieve similarly strong performance and fairness outcomes, confirming its model-agnostic design. We further compare this lightweight adapter training method with traditional full-model training bias mitigation techniques, including adversarial training, reweighting, data augmentation, and active learning, and find that our approach offers competitive or superior bias reduction at a fraction of the computational cost. Finally, we show that combining eXtreme Gradient Boosting retraining with active learning yields the largest reduction in bias across all demographic subgroups, both in and out of distribution on the CheXpert and MIMIC datasets, establishing a practical and effective path toward equitable deep learning deployment in clinical radiology.
comment: Preprint of an article published in Pacific Symposium on Biocomputing \c{opyright} 2026 World Scientific Publishing Co., Singapore, http://psb.stanford.edu/
☆ MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation
Clouds remain a critical challenge in optical satellite imagery, hindering reliable analysis for environmental monitoring, land cover mapping, and climate research. To overcome this, we propose MSCloudCAM, a Cross-Attention with Multi-Scale Context Network tailored for multispectral and multi-sensor cloud segmentation. Our framework exploits the spectral richness of Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data to classify four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow. MSCloudCAM combines a Swin Transformer backbone for hierarchical feature extraction with multi-scale context modules ASPP and PSP for enhanced scale-aware learning. A Cross-Attention block enables effective multisensor and multispectral feature fusion, while the integration of an Efficient Channel Attention Block (ECAB) and a Spatial Attention Module adaptively refine feature representations. Comprehensive experiments on CloudSEN12 and L8Biome demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy, surpassing leading baseline architectures while maintaining competitive parameter efficiency and FLOPs. These results underscore the model's effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.
comment: 7 pages, 2 Figures
☆ Full segmentation annotations of 3D time-lapse microscopy images of MDA231 cells
High-quality, publicly available segmentation annotations of image and video datasets are critical for advancing the field of image processing. In particular, annotations of volumetric images of a large number of targets are time-consuming and challenging. In (Melnikova, A., & Matula, P., 2025), we presented the first publicly available full 3D time-lapse segmentation annotations of migrating cells with complex dynamic shapes. Concretely, three distinct humans annotated two sequences of MDA231 human breast carcinoma cells (Fluo-C3DL-MDA231) from the Cell Tracking Challenge (CTC). This paper aims to provide a comprehensive description of the dataset and accompanying experiments that were not included in (Melnikova, A., & Matula, P., 2025) due to limitations in publication space. Namely, we show that the created annotations are consistent with the previously published tracking markers provided by the CTC organizers and the segmentation accuracy measured based on the 2D gold truth of CTC is within the inter-annotator variability margins. We compared the created 3D annotations with automatically created silver truth provided by CTC. We have found the proposed annotations better represent the complexity of the input images. The presented annotations can be used for testing and training cell segmentation, or analyzing 3D shapes of highly dynamic objects.
comment: 6 pages, 2 figures, 4 tables
☆ ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling ICCV 2025
Over the last years, 3D morphable models (3DMMs) have emerged as a state-of-the-art methodology for modeling and generating expressive 3D avatars. However, given their reliance on a strict topology, along with their linear nature, they struggle to represent complex full-head shapes. Following the advent of deep implicit functions, we propose imHead, a novel implicit 3DMM that not only models expressive 3D head avatars but also facilitates localized editing of the facial features. Previous methods directly divided the latent space into local components accompanied by an identity encoding to capture the global shape variations, leading to expensive latent sizes. In contrast, we retain a single compact identity space and introduce an intermediate region-specific latent representation to enable local edits. To train imHead, we curate a large-scale dataset of 4K distinct identities, making a step-towards large scale 3D head modeling. Under a series of experiments we demonstrate the expressive power of the proposed model to represent diverse identities and expressions outperforming previous approaches. Additionally, the proposed approach provides an interpretable solution for 3D face manipulation, allowing the user to make localized edits.
comment: ICCV 2025
☆ DISC-GAN: Disentangling Style and Content for Cluster-Specific Synthetic Underwater Image Generation
In this paper, we propose a novel framework, Disentangled Style-Content GAN (DISC-GAN), which integrates style-content disentanglement with a cluster-specific training strategy towards photorealistic underwater image synthesis. The quality of synthetic underwater images is challenged by optical due to phenomena such as color attenuation and turbidity. These phenomena are represented by distinct stylistic variations across different waterbodies, such as changes in tint and haze. While generative models are well-suited to capture complex patterns, they often lack the ability to model the non-uniform conditions of diverse underwater environments. To address these challenges, we employ K-means clustering to partition a dataset into style-specific domains. We use separate encoders to get latent spaces for style and content; we further integrate these latent representations via Adaptive Instance Normalization (AdaIN) and decode the result to produce the final synthetic image. The model is trained independently on each style cluster to preserve domain-specific characteristics. Our framework demonstrates state-of-the-art performance, obtaining a Structural Similarity Index (SSIM) of 0.9012, an average Peak Signal-to-Noise Ratio (PSNR) of 32.5118 dB, and a Frechet Inception Distance (FID) of 13.3728.
☆ Structured Spectral Graph Learning for Multi-label Abnormality Classification in 3D Chest CT Scans
With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.\\ This work extends our previous contribution presented at the MICCAI 2025 EMERGE Workshop.
comment: 22 pages, 14 figures
☆ EGD-YOLO: A Lightweight Multimodal Framework for Robust Drone-Bird Discrimination via Ghost-Enhanced YOLOv8n and EMA Attention under Adverse Condition
Identifying drones and birds correctly is essential for keeping the skies safe and improving security systems. Using the VIP CUP 2025 dataset, which provides both RGB and infrared (IR) images, this study presents EGD-YOLOv8n, a new lightweight yet powerful model for object detection. The model improves how image features are captured and understood, making detection more accurate and efficient. It uses smart design changes and attention layers to focus on important details while reducing the amount of computation needed. A special detection head helps the model adapt to objects of different shapes and sizes. We trained three versions: one using RGB images, one using IR images, and one combining both. The combined model achieved the best accuracy and reliability while running fast enough for real-time use on common GPUs.
☆ Optimally Deep Networks -- Adapting Model Depth to Datasets for Superior Efficiency
Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.
comment: 6 pages, 3 figures, 1 table
☆ Restricted Receptive Fields for Face Verification
Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model's actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.
☆ Uncovering Anomalous Events for Marine Environmental Monitoring via Visual Anomaly Detection
Underwater video monitoring is a promising strategy for assessing marine biodiversity, but the vast volume of uneventful footage makes manual inspection highly impractical. In this work, we explore the use of visual anomaly detection (VAD) based on deep neural networks to automatically identify interesting or anomalous events. We introduce AURA, the first multi-annotator benchmark dataset for underwater VAD, and evaluate four VAD models across two marine scenes. We demonstrate the importance of robust frame selection strategies to extract meaningful video segments. Our comparison against multiple annotators reveals that VAD performance of current models varies dramatically and is highly sensitive to both the amount of training data and the variability in visual content that defines "normal" scenes. Our results highlight the value of soft and consensus labels and offer a practical approach for supporting scientific exploration and scalable biodiversity monitoring.
☆ Seeing My Future: Predicting Situated Interaction Behavior in Virtual Reality
Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors - such as gaze direction and object interactions - which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.
comment: Project Page: https://xy02-05.github.io/Seeing_My_Future/
☆ WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting
We present WorldMirror, an all-in-one, feed-forward model for versatile 3D geometric prediction tasks. Unlike existing methods constrained to image-only inputs or customized for a specific task, our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps, while simultaneously generating multiple 3D representations: dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussians. This elegant and unified architecture leverages available prior information to resolve structural ambiguities and delivers geometrically consistent 3D outputs in a single forward pass. WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis, while maintaining the efficiency of feed-forward inference. Code and models will be publicly available soon.
comment: Project page, code, and models will be publicly available soon
☆ VLM-Guided Adaptive Negative Prompting for Creative Generation
Creative generation is the synthesis of new, surprising, and valuable samples that reflect user intent yet cannot be envisioned in advance. This task aims to extend human imagination, enabling the discovery of visual concepts that exist in the unexplored spaces between familiar domains. While text-to-image diffusion models excel at rendering photorealistic scenes that faithfully match user prompts, they still struggle to generate genuinely novel content. Existing approaches to enhance generative creativity either rely on interpolation of image features, which restricts exploration to predefined categories, or require time-intensive procedures such as embedding optimization or model fine-tuning. We propose VLM-Guided Adaptive Negative-Prompting, a training-free, inference-time method that promotes creative image generation while preserving the validity of the generated object. Our approach utilizes a vision-language model (VLM) that analyzes intermediate outputs of the generation process and adaptively steers it away from conventional visual concepts, encouraging the emergence of novel and surprising outputs. We evaluate creativity through both novelty and validity, using statistical metrics in the CLIP embedding space. Through extensive experiments, we show consistent gains in creative novelty with negligible computational overhead. Moreover, unlike existing methods that primarily generate single objects, our approach extends to complex scenarios, such as generating coherent sets of creative objects and preserving creativity within elaborate compositional prompts. Our method integrates seamlessly into existing diffusion pipelines, offering a practical route to producing creative outputs that venture beyond the constraints of textual descriptions.
comment: Project page at: https://shelley-golan.github.io/VLM-Guided-Creative-Generation/
☆ Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos
This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code and trained model will be made publicly available.
☆ Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding
Action understanding, encompassing action detection and anticipation, plays a crucial role in numerous practical applications. However, untrimmed videos are often characterized by substantial redundant information and noise. Moreover, in modeling action understanding, the influence of the agent's intention on the action is often overlooked. Motivated by these issues, we propose a novel framework called the State-Specific Model (SSM), designed to unify and enhance both action detection and anticipation tasks. In the proposed framework, the Critical State-Based Memory Compression module compresses frame sequences into critical states, reducing information redundancy. The Action Pattern Learning module constructs a state-transition graph with multi-dimensional edges to model action dynamics in complex scenarios, on the basis of which potential future cues can be generated to represent intention. Furthermore, our Cross-Temporal Interaction module models the mutual influence between intentions and past as well as current information through cross-temporal interactions, thereby refining present and future features and ultimately realizing simultaneous action detection and anticipation. Extensive experiments on multiple benchmark datasets -- including EPIC-Kitchens-100, THUMOS'14, TVSeries, and the introduced Parkinson's Disease Mouse Behaviour (PDMB) dataset -- demonstrate the superior performance of our proposed framework compared to other state-of-the-art approaches. These results highlight the importance of action dynamics learning and cross-temporal interactions, laying a foundation for future action understanding research.
comment: 10 pages, 9 figures
☆ MSM-Seg: A Modality-and-Slice Memory Framework with Category-Agnostic Prompting for Multi-Modal Brain Tumor Segmentation
Multi-modal brain tumor segmentation is critical for clinical diagnosis, and it requires accurate identification of distinct internal anatomical subregions. While the recent prompt-based segmentation paradigms enable interactive experiences for clinicians, existing methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world scenarios. To address these issues, we propose a MSM-Seg framework for multi-modal brain tumor segmentation. The MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates multi-modal and inter-slice information with the efficient category-agnostic prompt for brain tumor understanding. To this end, we first devise a modality-and-slice memory attention (MSMA) to exploit the cross-modal and inter-slice relationships among the input scans. Then, we propose a multi-scale category-agnostic prompt encoder (MCP-Encoder) to provide tumor region guidance for decoding. Moreover, we devise a modality-adaptive fusion decoder (MF-Decoder) that leverages the complementary decoding information across different modalities to improve segmentation accuracy. Extensive experiments on different MRI datasets demonstrate that our MSM-Seg framework outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation. The code is available at https://github.com/xq141839/MSM-Seg.
comment: Under Review
☆ Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.
comment: Draft version, work in progress
☆ AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.
☆ Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.
comment: 18 pages, 9 figures, project page: https://fsfm-3c.github.io/fsvfm.html
☆ Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping
As one of the fundamental modules in autonomous driving, online high-definition (HD) maps have attracted significant attention due to their cost-effectiveness and real-time capabilities. Since vehicles always cruise in highly dynamic environments, spatial displacement of onboard sensors inevitably causes shifts in real-time HD mapping results, and such instability poses fundamental challenges for downstream tasks. However, existing online map construction models tend to prioritize improving each frame's mapping accuracy, while the mapping stability has not yet been systematically studied. To fill this gap, this paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models. We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Extensive experiments on 42 models and variants show that accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. We further analyze the impact of key model design choices on both criteria, identifying architectural and training factors that contribute to high accuracy, high stability, or both. To encourage broader focus on stability, we will release a public benchmark. Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems. The benchmark toolkit, code, and models will be available at https://stablehdmap.github.io/.
☆ A Machine Learning Perspective on Automated Driving Corner Cases
For high-stakes applications, like autonomous driving, a safe operation is necessary to prevent harm, accidents, and failures. Traditionally, difficult scenarios have been categorized into corner cases and addressed individually. However, this example-based categorization is not scalable and lacks a data coverage perspective, neglecting the generalization to training data of machine learning models. In our work, we propose a novel machine learning approach that takes the underlying data distribution into account. Based on our novel perspective, we present a framework for effective corner case recognition for perception on individual samples. In our evaluation, we show that our approach (i) unifies existing scenario-based corner case taxonomies under a distributional perspective, (ii) achieves strong performance on corner case detection tasks across standard benchmarks for which we extend established out-of-distribution detection benchmarks, and (iii) enables analysis of combined corner cases via a newly introduced fog-augmented Lost & Found dataset. These results provide a principled basis for corner case recognition, underlining our manual specification-free definition.
☆ DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis
Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.
comment: 5 pages
☆ JND-Guided Light-Weight Neural Pre-Filter for Perceptual Image Coding
Just Noticeable Distortion (JND)-guided pre-filter is a promising technique for improving the perceptual compression efficiency of image coding. However, existing methods are often computationally expensive, and the field lacks standardized benchmarks for fair comparison. To address these challenges, this paper introduces a twofold contribution. First, we develop and open-source FJNDF-Pytorch, a unified benchmark for frequency-domain JND-Guided pre-filters. Second, leveraging this platform, we propose a complete learning framework for a novel, lightweight Convolutional Neural Network (CNN). Experimental results demonstrate that our proposed method achieves state-of-the-art compression efficiency, consistently outperforming competitors across multiple datasets and encoders. In terms of computational cost, our model is exceptionally lightweight, requiring only 7.15 GFLOPs to process a 1080p image, which is merely 14.1% of the cost of recent lightweight network. Our work presents a robust, state-of-the-art solution that excels in both performance and efficiency, supported by a reproducible research platform. The open-source implementation is available at https://github.com/viplab-fudan/FJNDF-Pytorch.
comment: 5 pages, 4 figures. Submitted to the IEEE International Symposium on Circuits and Systems (ISCAS) 2026
☆ GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus
Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in expressiveness due to low-rank projection structures and overly uniform attention distributions. We theoretically prove that these properties reduce the class separability of node representations, limiting the model's classification ability. To address this, we propose a novel hybrid framework that enhances both the rank and focus of attention. Specifically, we enhance linear attention by attaching a gated local graph network branch to the value matrix, thereby increasing the rank of the resulting attention map. Furthermore, to alleviate the excessive smoothing effect inherent in linear attention, we introduce a learnable log-power function into the attention scores to reduce entropy and sharpen focus. We theoretically show that this function decreases entropy in the attention distribution, enhancing the separability of learned embeddings. Extensive experiments on both homophilic and heterophilic graph benchmarks demonstrate that our method achieves competitive performance while preserving the scalability of linear attention.
☆ ImpMIA: Leveraging Implicit Bias for Membership Inference Attack under Realistic Scenarios
Determining which data samples were used to train a model-known as Membership Inference Attack (MIA)-is a well-studied and important problem with implications for data privacy. Black-box methods presume access only to the model's outputs and often rely on training auxiliary reference models. While they have shown strong empirical performance, they rely on assumptions that rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. In this paper, we demonstrate that removing these assumptions leads to a significant drop in the performance of black-box attacks. We introduce ImpMIA, a Membership Inference Attack that exploits the Implicit Bias of neural networks, hence removes the need to rely on any reference models and their assumptions. ImpMIA is a white-box attack -- a setting which assumes access to model weights and is becoming increasingly realistic given that many models are publicly available (e.g., via Hugging Face). Building on maximum-margin implicit bias theory, ImpMIA uses the Karush-Kuhn-Tucker (KKT) optimality conditions to identify training samples. This is done by finding the samples whose gradients most strongly reconstruct the trained model's parameters. As a result, ImpMIA achieves state-of-the-art performance compared to both black and white box attacks in realistic settings where only the model weights and a superset of the training data are available.
☆ UltraScatter: Ray-Based Simulation of Ultrasound Scattering
Traditional ultrasound simulation methods solve wave equations numerically, achieving high accuracy but at substantial computational cost. Faster alternatives based on convolution with precomputed impulse responses remain relatively slow, often requiring several minutes to generate a full B-mode image. We introduce UltraScatter, a probabilistic ray tracing framework that models ultrasound scattering efficiently and realistically. Tissue is represented as a volumetric field of scattering probability and scattering amplitude, and ray interactions are simulated via free-flight delta tracking. Scattered rays are traced to the transducer, with phase information incorporated through a linear time-of-flight model. Integrated with plane-wave imaging and beamforming, our parallelized ray tracing architecture produces B-mode images within seconds. Validation with phantom data shows realistic speckle and inclusion patterns, positioning UltraScatter as a scalable alternative to wave-based methods.
comment: Accepted at IEEE IUS 2025
☆ OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment
Current visual evaluation approaches are typically constrained to a single task. To address this, we propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization. Inspired by subjective experiments, where participants are given task-specific instructions outlining distinct assessment principles prior to evaluation, we propose OmniQuality-R, a structured reward modeling framework that transforms multi-dimensional reasoning into continuous and interpretable reward signals. To enable this, we construct a reasoning-enhanced reward modeling dataset by sampling informative plan-reason trajectories via rejection sampling, forming a reliable chain-of-thought (CoT) dataset for supervised fine-tuning (SFT). Building on this, we apply Group Relative Policy Optimization (GRPO) for post-training, using a Gaussian-based reward to support continuous score prediction. To further stabilize the training and improve downstream generalization, we incorporate standard deviation (STD) filtering and entropy gating mechanisms during reinforcement learning. These techniques suppress unstable updates and reduce variance in policy optimization. We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
☆ ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
☆ SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams
Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects.
☆ A Simple and Better Baseline for Visual Grounding ICME2025
Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.
comment: ICME2025
☆ Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection
Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$\beta$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.
☆ Injecting Frame-Event Complementary Fusion into Diffusion for Optical Flow in Challenging Scenes
Optical flow estimation has achieved promising results in conventional scenes but faces challenges in high-speed and low-light scenes, which suffer from motion blur and insufficient illumination. These conditions lead to weakened texture and amplified noise and deteriorate the appearance saturation and boundary completeness of frame cameras, which are necessary for motion feature matching. In degraded scenes, the frame camera provides dense appearance saturation but sparse boundary completeness due to its long imaging time and low dynamic range. In contrast, the event camera offers sparse appearance saturation, while its short imaging time and high dynamic range gives rise to dense boundary completeness. Traditionally, existing methods utilize feature fusion or domain adaptation to introduce event to improve boundary completeness. However, the appearance features are still deteriorated, which severely affects the mostly adopted discriminative models that learn the mapping from visual features to motion fields and generative models that generate motion fields based on given visual features. So we introduce diffusion models that learn the mapping from noising flow to clear flow, which is not affected by the deteriorated visual features. Therefore, we propose a novel optical flow estimation framework Diff-ABFlow based on diffusion models with frame-event appearance-boundary fusion.
☆ UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. Extensive experiments across 13 challenging benchmarks spanning 7 widely studied visual understanding and generation tasks demonstrate that UniFlow achieves a win-win outcome. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 7.75% on average understanding benchmarks, but also achieves competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.
☆ Deep semi-supervised approach based on consistency regularization and similarity learning for weeds classification
Weed species classification represents an important step for the development of automated targeting systems that allow the adoption of precision agriculture practices. To reduce costs and yield losses caused by their presence. The identification of weeds is a challenging problem due to their shared similarities with crop plants and the variability related to the differences in terms of their types. Along with the variations in relation to changes in field conditions. Moreover, to fully benefit from deep learning-based methods, large fully annotated datasets are needed. This requires time intensive and laborious process for data labeling, which represents a limitation in agricultural applications. Hence, for the aim of improving the utilization of the unlabeled data, regarding conditions of scarcity in terms of the labeled data available during the learning phase and provide robust and high classification performance. We propose a deep semi-supervised approach, that combines consistency regularization with similarity learning. Through our developed deep auto-encoder architecture, experiments realized on the DeepWeeds dataset and inference in noisy conditions demonstrated the effectiveness and robustness of our method in comparison to state-of-the-art fully supervised deep learning models. Furthermore, we carried out ablation studies for an extended analysis of our proposed joint learning strategy.
comment: Submitted to EURASIP Journal on Image and Video Processing
☆ BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.
comment: 6 pages, BabyLM Workshop, EMNLP 2025
☆ MRS-YOLO Railroad Transmission Line Foreign Object Detection Based on Improved YOLO11 and Channel Pruning
Aiming at the problems of missed detection, false detection and low detection efficiency in transmission line foreign object detection under railway environment, we proposed an improved algorithm MRS-YOLO based on YOLO11. Firstly, a multi-scale Adaptive Kernel Depth Feature Fusion (MAKDF) module is proposed and fused with the C3k2 module to form C3k2_MAKDF, which enhances the model's feature extraction capability for foreign objects of different sizes and shapes. Secondly, a novel Re-calibration Feature Fusion Pyramid Network (RCFPN) is designed as a neck structure to enhance the model's ability to integrate and utilize multi-level features effectively. Then, Spatial and Channel Reconstruction Detect Head (SC_Detect) based on spatial and channel preprocessing is designed to enhance the model's overall detection performance. Finally, the channel pruning technique is used to reduce the redundancy of the improved model, drastically reduce Parameters and Giga Floating Point Operations Per Second (GFLOPs), and improve the detection efficiency. The experimental results show that the mAP50 and mAP50:95 of the MRS-YOLO algorithm proposed in this paper are improved to 94.8% and 86.4%, respectively, which are 0.7 and 2.3 percentage points higher compared to the baseline, while Parameters and GFLOPs are reduced by 44.2% and 17.5%, respectively. It is demonstrated that the improved algorithm can be better applied to the task of foreign object detection in railroad transmission lines.
☆ GLOFNet -- A Multimodal Dataset for GLOF Monitoring and Prediction
Glacial Lake Outburst Floods (GLOFs) are rare but destructive hazards in high mountain regions, yet predictive research is hindered by fragmented and unimodal data. Most prior efforts emphasize post-event mapping, whereas forecasting requires harmonized datasets that combine visual indicators with physical precursors. We present GLOFNet, a multimodal dataset for GLOF monitoring and prediction, focused on the Shisper Glacier in the Karakoram. It integrates three complementary sources: Sentinel-2 multispectral imagery for spatial monitoring, NASA ITS_LIVE velocity products for glacier kinematics, and MODIS Land Surface Temperature records spanning over two decades. Preprocessing included cloud masking, quality filtering, normalization, temporal interpolation, augmentation, and cyclical encoding, followed by harmonization across modalities. Exploratory analysis reveals seasonal glacier velocity cycles, long-term warming of ~0.8 K per decade, and spatial heterogeneity in cryospheric conditions. The resulting dataset, GLOFNet, is publicly available to support future research in glacial hazard prediction. By addressing challenges such as class imbalance, cloud contamination, and coarse resolution, GLOFNet provides a structured foundation for benchmarking multimodal deep learning approaches to rare hazard prediction.
☆ MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates
Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The journal preprint version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.
comment: This is the accepted version of an article that has been published in \textbf{Pattern Recognition}. The final published version will be available soon
☆ Layout-Independent License Plate Recognition via Integrated Vision and Language Models
This work presents a pattern-aware framework for automatic license plate recognition (ALPR), designed to operate reliably across diverse plate layouts and challenging real-world conditions. The proposed system consists of a modern, high-precision detection network followed by a recognition stage that integrates a transformer-based vision model with an iterative language modelling mechanism. This unified recognition stage performs character identification and post-OCR refinement in a seamless process, learning the structural patterns and formatting rules specific to license plates without relying on explicit heuristic corrections or manual layout classification. Through this design, the system jointly optimizes visual and linguistic cues, enables iterative refinement to improve OCR accuracy under noise, distortion, and unconventional fonts, and achieves layout-independent recognition across multiple international datasets (IR-LPR, UFPR-ALPR, AOLP). Experimental results demonstrate superior accuracy and robustness compared to recent segmentation-free approaches, highlighting how embedding pattern analysis within the recognition stage bridges computer vision and language modelling for enhanced adaptability in intelligent transportation and surveillance applications.
☆ Unified Open-World Segmentation with Multi-Modal Prompts ICCV2025
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
comment: Accepted to ICCV2025
☆ Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks
Recently, several look-up table (LUT) methods were developed to greatly expedite the inference of CNNs in a classical strategy of trading space for speed. However, these LUT methods suffer from a common drawback of limited receptive field of the convolution kernels due to the combinatorial explosion of table size. This research aims to expand the CNN receptive field with a fixed table size, thereby enhancing the performance of LUT-driven fast CNN inference while maintaining the same space complexity. To achieve this goal, various techniques are proposed. The main contribution is a novel approach of learning an optimal lattice vector quantizer that adaptively allocates the quantization resolution across data dimensions based on their significance to the inference task. In addition, the lattice vector quantizer offers an inherently more accurate approximation of CNN kernels than scalar quantizer as used in current practice. Furthermore, we introduce other receptive field expansion strategies, including irregular dilated convolutions and a U-shaped cascaded LUT structure, designed to capture multi-level contextual information without inflating table size. Together, these innovations allow our approach to effectively balance speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods.
☆ VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
☆ SuperEx: Enhancing Indoor Mapping and Exploration using Non-Line-of-Sight Perception
Efficient exploration and mapping in unknown indoor environments is a fundamental challenge, with high stakes in time-critical settings. In current systems, robot perception remains confined to line-of-sight; occluded regions remain unknown until physically traversed, leading to inefficient exploration when layouts deviate from prior assumptions. In this work, we bring non-line-of-sight (NLOS) sensing to robotic exploration. We leverage single-photon LiDARs, which capture time-of-flight histograms that encode the presence of hidden objects - allowing robots to look around blind corners. Recent single-photon LiDARs have become practical and portable, enabling deployment beyond controlled lab settings. Prior NLOS works target 3D reconstruction in static, lab-based scenarios, and initial efforts toward NLOS-aided navigation consider simplified geometries. We introduce SuperEx, a framework that integrates NLOS sensing directly into the mapping-exploration loop. SuperEx augments global map prediction with beyond-line-of-sight cues by (i) carving empty NLOS regions from timing histograms and (ii) reconstructing occupied structure via a two-step physics-based and data-driven approach that leverages structural regularities. Evaluations on complex simulated maps and the real-world KTH Floorplan dataset show a 12% gain in mapping accuracy under < 30% coverage and improved exploration efficiency compared to line-of-sight baselines, opening a path to reliable mapping beyond direct visibility.
comment: 8 pages, 9 Figures , Project webpage: https://super-ex.github.io/
☆ Jigsaw3D: Disentangled 3D Style Transfer via Patch Shuffling and Masking
Controllable 3D style transfer seeks to restyle a 3D asset so that its textures match a reference image while preserving the integrity and multi-view consistency. The prevalent methods either rely on direct reference style token injection or score-distillation from 2D diffusion models, which incurs heavy per-scene optimization and often entangles style with semantic content. We introduce Jigsaw3D, a multi-view diffusion based pipeline that decouples style from content and enables fast, view-consistent stylization. Our key idea is to leverage the jigsaw operation - spatial shuffling and random masking of reference patches - to suppress object semantics and isolate stylistic statistics (color palettes, strokes, textures). We integrate these style cues into a multi-view diffusion model via reference-to-view cross-attention, producing view-consistent stylized renderings conditioned on the input mesh. The renders are then style-baked onto the surface to yield seamless textures. Across standard 3D stylization benchmarks, Jigsaw3D achieves high style fidelity and multi-view consistency with substantially lower latency, and generalizes to masked partial reference stylization, multi-object scene styling, and tileable texture generation. Project page is available at: https://babahui.github.io/jigsaw3D.github.io/
comment: 23 pages, 16 figures and 1 table
☆ Towards Efficient 3D Gaussian Human Avatar Compression: A Prior-Guided Framework
This paper proposes an efficient 3D avatar coding framework that leverages compact human priors and canonical-to-target transformation to enable high-quality 3D human avatar video compression at ultra-low bit rates. The framework begins by training a canonical Gaussian avatar using articulated splatting in a network-free manner, which serves as the foundation for avatar appearance modeling. Simultaneously, a human-prior template is employed to capture temporal body movements through compact parametric representations. This decomposition of appearance and temporal evolution minimizes redundancy, enabling efficient compression: the canonical avatar is shared across the sequence, requiring compression only once, while the temporal parameters, consisting of just 94 parameters per frame, are transmitted with minimal bit-rate. For each frame, the target human avatar is generated by deforming canonical avatar via Linear Blend Skinning transformation, facilitating temporal coherent video reconstruction and novel view synthesis. Experimental results demonstrate that the proposed method significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression method in terms of rate-distortion performance on mainstream multi-view human video datasets, paving the way for seamless immersive multimedia experiences in meta-verse applications.
comment: 10 pages, 4 figures
☆ Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.
☆ Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Vision-Language Models (VLMs) integrate visual knowledge with the analytical capabilities of Large Language Models (LLMs) through supervised visual instruction tuning, using image-question-answer triplets. However, the potential of VLMs trained without supervised instruction remains largely unexplored. This study validates that VLMs possess inherent self-refinement capabilities, enabling them to generate high-quality supervised data without external inputs and thereby learn autonomously. Specifically, to stimulate the self-refinement ability of VLMs, we propose a self-refinement framework based on a Triangular Consistency principle: within the image-query-answer triangle, any masked elements should be consistently and accurately reconstructed. The framework involves three steps: (1) We enable the instruction generation ability of VLMs by adding multi-task instruction tuning like image$\rightarrow$question-answer or image-answer$\rightarrow$question. (2) We generate image-query-answer triplets from unlabeled images and use the Triangular Consistency principle for filtering. (3) The model is further updated using the filtered synthetic data. To investigate the underlying mechanisms behind this self-refinement capability, we conduct a theoretical analysis from a causal perspective. Using the widely recognized LLaVA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. We expect that the insights of this study on the self-refinement ability of VLMs can inspire future research on the learning mechanism of VLMs. Code is available at https://github.com/dengyl20/SRF-LLaVA-1.5.
☆ MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition
Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.
☆ DAGLFNet:Deep Attention-Guided Global-Local Feature Fusion for Pseudo-Image Point Cloud Segmentation
Environmental perception systems play a critical role in high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor that provides accurate 3D point cloud data. How to efficiently process unstructured point clouds while extracting structured semantic information remains a significant challenge, and in recent years, numerous pseudo-image-based representation methods have emerged to achieve a balance between efficiency and performance. However, they often overlook the structural and semantic details of point clouds, resulting in limited feature fusion and discriminability. In this work, we propose DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. First, the Global-Local Feature Fusion Encoding module is used to enhance the correlation among local features within a set and capture global contextual information. Second, the Multi-Branch Feature Extraction network is employed to capture more neighborhood information and enhance the discriminability of contour features. Finally, a Feature Fusion via Deep Feature-guided Attention mechanism is introduced to improve the precision of cross-channel feature fusion. Experimental evaluations show that DAGLFNet achieves 69.83\% and 78.65\% on the validation sets of SemanticKITTI and nuScenes, respectively. The method balances high performance with real-time capability, demonstrating great potential for LiDAR-based real-time applications.
☆ When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.
☆ Post-TIPS Prediction via Multimodal Interaction: A Multi-Center Dataset and Framework for Survival, Complication, and Portal Pressure Assessment
Transjugular intrahepatic portosystemic shunt (TIPS) is an established procedure for portal hypertension, but provides variable survival outcomes and frequent overt hepatic encephalopathy (OHE), indicating the necessity of accurate preoperative prognostic modeling. Current studies typically build machine learning models from preoperative CT images or clinical characteristics, but face three key challenges: (1) labor-intensive region-of-interest (ROI) annotation, (2) poor reliability and generalizability of unimodal methods, and (3) incomplete assessment from single-endpoint prediction. Moreover, the lack of publicly accessible datasets constrains research in this field. Therefore, we present MultiTIPS, the first public multi-center dataset for TIPS prognosis, and propose a novel multimodal prognostic framework based on it. The framework comprises three core modules: (1) dual-option segmentation, which integrates semi-supervised and foundation model-based pipelines to achieve robust ROI segmentation with limited annotations and facilitate subsequent feature extraction; (2) multimodal interaction, where three techniques, multi-grained radiomics attention (MGRA), progressive orthogonal disentanglement (POD), and clinically guided prognostic enhancement (CGPE), are introduced to enable cross-modal feature interaction and complementary representation integration, thus improving model accuracy and robustness; and (3) multi-task prediction, where a staged training strategy is used to perform stable optimization of survival, portal pressure gradient (PPG), and OHE prediction for comprehensive prognostic assessment. Extensive experiments on MultiTIPS demonstrate the superiority of the proposed method over state-of-the-art approaches, along with strong cross-domain generalization and interpretability, indicating its promise for clinical application. The dataset and code are available.
comment: 81 pages, 13 figures
☆ Learning from Disagreement: A Group Decision Simulation Framework for Robust Medical Image Segmentation
Medical image segmentation annotation suffers from inter-rater variability (IRV) due to differences in annotators' expertise and the inherent blurriness of medical images. Standard approaches that simply average expert labels are flawed, as they discard the valuable clinical uncertainty revealed in disagreements. We introduce a fundamentally new approach with our group decision simulation framework, which works by mimicking the collaborative decision-making process of a clinical panel. Under this framework, an Expert Signature Generator (ESG) learns to represent individual annotator styles in a unique latent space. A Simulated Consultation Module (SCM) then intelligently generates the final segmentation by sampling from this space. This method achieved state-of-the-art results on challenging CBCT and MRI datasets (92.11% and 90.72% Dice scores). By treating expert disagreement as a useful signal instead of noise, our work provides a clear path toward more robust and trustworthy AI systems for healthcare.
☆ On the Problem of Consistent Anomalies in Zero-Shot Industrial Anomaly Detection
Zero-shot image anomaly classification (AC) and segmentation (AS) are vital for industrial quality control, detecting defects without prior training data. Existing representation-based methods compare patch features with nearest neighbors in unlabeled test images but struggle with consistent anomalies -- similar defects recurring across multiple images -- resulting in poor AC/AS performance. We introduce Consistent-Anomaly Detection Graph (CoDeGraph), a novel algorithm that identifies and filters consistent anomalies from similarity computations. Our key insight is that normal patches in industrial images show stable, gradually increasing similarity to other test images, while consistent-anomaly patches exhibit abrupt similarity spikes after exhausting a limited set of similar matches, a phenomenon we term ``neighbor-burnout.'' CoDeGraph constructs an image-level graph, with images as nodes and edges connecting those with shared consistent-anomaly patterns, using community detection to filter these anomalies. We provide a theoretical foundation using Extreme Value Theory to explain the effectiveness of our approach. Experiments on MVTec AD with the ViT-L-14-336 backbone achieve 98.3% AUROC for AC and AS performance of 66.8% (+4.2%) F1 and 68.1% (+5.4%) AP over state-of-the-art zero-shot methods. Using the DINOv2 backbone further improves segmentation, yielding 69.1% (+6.5%) F1 and 71.9% (+9.2%) AP, demonstrating robustness across architectures.
comment: Published in TMLR (10/2025)
☆ MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation
We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.
☆ Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs
Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.
comment: 12 pages, 5 figures
☆ Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling
With the rapid advancement of virtual reality (VR) technology, its adoption across domains such as healthcare, education, and entertainment has grown significantly. However, the persistent issue of cybersickness, marked by symptoms resembling motion sickness, continues to hinder widespread acceptance of VR. While recent research has explored multimodal deep learning approaches leveraging data from integrated VR sensors like eye and head tracking, there remains limited investigation into the use of video-based features for predicting cybersickness. In this study, we address this gap by utilizing transfer learning to extract high-level visual features from VR gameplay videos using the InceptionV3 model pretrained on the ImageNet dataset. These features are then passed to a Long Short-Term Memory (LSTM) network to capture the temporal dynamics of the VR experience and predict cybersickness severity over time. Our approach effectively leverages the time-series nature of video data, achieving a 68.4% classification accuracy for cybersickness severity. This surpasses the performance of existing models trained solely on video data, providing a practical tool for VR developers to evaluate and mitigate cybersickness in virtual environments. Furthermore, this work lays the foundation for future research on video-based temporal modeling for enhancing user comfort in VR applications.
☆ Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis
Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50{\deg}), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.
Guided Image Feature Matching using Feature Spatial Order
Image feature matching plays a vital role in many computer vision tasks. Although many image feature detection and matching techniques have been proposed over the past few decades, it is still time-consuming to match feature points in two images, especially for images with a large number of detected features. Feature spatial order can estimate the probability that a pair of features is correct. Since it is a completely independent concept from epipolar geometry, it can be used to complement epipolar geometry in guiding feature match in a target region so as to improve matching efficiency. In this paper, we integrate the concept of feature spatial order into a progressive matching framework. We use some of the initially matched features to build a computational model of feature spatial order and employs it to calculates the possible spatial range of subsequent feature matches, thus filtering out unnecessary feature matches. We also integrate it with epipolar geometry to further improve matching efficiency and accuracy. Since the spatial order of feature points is affected by image rotation, we propose a suitable image alignment method from the fundamental matrix of epipolar geometry to remove the effect of image rotation. To verify the feasibility of the proposed method, we conduct a series of experiments, including a standard benchmark dataset, self-generated simulated images, and real images. The results demonstrate that our proposed method is significantly more efficient and has more accurate feature matching than the traditional method.
☆ Mesh-Gait: A Unified Framework for Gait Recognition Through Multi-Modal Representation Learning from 2D Silhouettes
Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.
☆ AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.
comment: Project webpage: https://avocado-captioner.github.io/
☆ Identifying bias in CNN image classification using image scrambling and transforms
CNNs are now prevalent as the primary choice for most machine vision problems due to their superior rate of classification and the availability of user-friendly libraries. These networks effortlessly identify and select features in a non-intuitive data-driven manner, making it difficult to determine which features were most influential. That leads to a ``black box", where users cannot know how the image data are analyzed but rely on empirical results. Therefore the decision-making process can be biased by background information that is difficult to detect. Here we discuss examples of such hidden biases and propose techniques for identifying them, methods to distinguish between contextual information and background noise, and explore whether CNNs learn from irrelevant features. One effective approach to identify dataset bias is to classify blank background parts of the images. However, in some situations a blank background in the images is not available, making it more difficult to separate the foreground information from the blank background. Such parts of the image can also be considered contextual learning, not necessarily bias. To overcome this, we propose two approaches that were tested on six different datasets, including natural, synthetic, and hybrid datasets. The first method involves dividing images into smaller, non-overlapping tiles of various sizes, which are then shuffled randomly, making classification more challenging. The second method involves the application of several image transforms, including Fourier, Wavelet transforms, and Median filter, and their combinations. These transforms help recover background noise information used by CNN to classify images. Results indicate that this method can effectively distinguish between contextual information and background noise, and alert on the presence of background noise even without the need to use background information.
comment: 62 pages, Master's thesis
☆ Self-Supervised Multi-Scale Transformer with Attention-Guided Fusion for Efficient Crack Detection
Pavement crack detection has long depended on costly and time-intensive pixel-level annotations, which limit its scalability for large-scale infrastructure monitoring. To overcome this barrier, this paper examines the feasibility of achieving effective pixel-level crack segmentation entirely without manual annotations. Building on this objective, a fully self-supervised framework, Crack-Segmenter, is developed, integrating three complementary modules: the Scale-Adaptive Embedder (SAE) for robust multi-scale feature extraction, the Directional Attention Transformer (DAT) for maintaining linear crack continuity, and the Attention-Guided Fusion (AGF) module for adaptive feature integration. Through evaluations on ten public datasets, Crack-Segmenter consistently outperforms 13 state-of-the-art supervised methods across all major metrics, including mean Intersection over Union (mIoU), Dice score, XOR, and Hausdorff Distance (HD). These findings demonstrate that annotation-free crack detection is not only feasible but also superior, enabling transportation agencies and infrastructure managers to conduct scalable and cost-effective monitoring. This work advances self-supervised learning and motivates pavement cracks detection research.
comment: The paper has been published at Automation in Construction journal. The paper has 53 pages and 11 figures
♻ ☆ Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos
Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as the Gate Shift Module (GSM) or the Gate Shift Fuse to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal shifts and channel grouped spatial attention, enabling efficient modeling of both short and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight, plug-and-play module that integrates seamlessly with diverse 2D backbones. To further advance the field, we introduce the Table Tennis Australia dataset, the first PES benchmark for table tennis containing over 4,800 precisely annotated events. Extensive experiments across four PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.
♻ ☆ Streamlining Image Editing with Layered Diffusion Brushes
Denoising diffusion models have emerged as powerful tools for image manipulation, yet interactive, localized editing workflows remain underdeveloped. We introduce Layered Diffusion Brushes (LDB), a novel training-free framework that enables interactive, layer-based editing using standard diffusion models. LDB defines each "layer" as a self-contained set of parameters guiding the generative process, enabling independent, non-destructive, and fine-grained prompt-guided edits, even in overlapping regions. LDB leverages a unique intermediate latent caching approach to reduce each edit to only a few denoising steps, achieving 140~ms per edit on consumer GPUs. An editor implementing LDB, incorporating familiar layer concepts, was evaluated via user study and quantitative metrics. Results demonstrate LDB's superior speed alongside comparable or improved image quality, background preservation, and edit fidelity relative to state-of-the-art methods across various sequential image manipulation tasks. The findings highlight LDB's ability to significantly enhance creative workflows by providing an intuitive and efficient approach to diffusion-based image editing and its potential for expansion into related subdomains, such as video editing.
comment: arXiv admin note: text overlap with arXiv:2306.00219
♻ ☆ DA$^2$: Depth Anything in Any Direction
Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.
comment: Work primarily done during an internship at Tencent Hunyuan. Project page: https://depth-any-in-any-dir.github.io/
♻ ☆ GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
comment: Tables 6 and Figures 8. https://mustansarfiaz.github.io/GeoVLM-R1/
♻ ☆ LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement ICCV 2025
Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning, yet full parameter fine-tuning is often computationally prohibitive for large models. Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. While LoRA allows for efficient fine-tuning, it requires significant data for adaptation, making Federated Learning (FL) an appealing solution due to its privacy-preserving collaborative framework. However, combining LoRA with FL introduces two key challenges: the \textbf{Server-Side Aggregation Bias}, where server-side averaging of LoRA matrices diverges from the ideal global update, and the \textbf{Client-Side Initialization Lag}, emphasizing the need for consistent initialization across rounds. Existing approaches address these challenges individually, limiting their effectiveness. We propose LoRA-FAIR, a novel method that tackles both issues by introducing a correction term on the server, enhancing aggregation efficiency and accuracy. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods. Experimental results on ViT and MLP-Mixer models across large-scale datasets demonstrate that LoRA-FAIR consistently achieves performance improvements in FL settings.
comment: ICCV 2025, Code is available: https://github.com/jmbian/LoRA-FAIR
♻ ☆ AI-powered skin spectral imaging enables instant sepsis diagnosis and outcome prediction in critically ill patients
With sepsis remaining a leading cause of mortality, early identification of patients with sepsis and those at high risk of death is a challenge of high socioeconomic importance. Given the potential of hyperspectral imaging (HSI) to monitor microcirculatory alterations, we propose a deep learning approach to automated sepsis diagnosis and mortality prediction using a single HSI cube acquired within seconds. In a prospective observational study, we collected HSI data from the palms and fingers of more than 480 intensive care unit patients. Neural networks applied to HSI measurements predicted sepsis and mortality with areas under the receiver operating characteristic curve (AUROCs) of 0.80 and 0.72, respectively. Performance improved substantially with additional clinical data, reaching AUROCs of 0.94 for sepsis and 0.83 for mortality. We conclude that deep learning-based HSI analysis enables rapid and noninvasive prediction of sepsis and mortality, with a potential clinical value for enhancing diagnosis and treatment.
comment: Markus A. Weigand, Lena Maier-Hein and Maximilian Dietrich contributed equally
♻ ☆ Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection
The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods rely solely on softmax probabilities, overlooking the rich discriminative potential of the feature embeddings learned by VLMs trained on millions of samples. To address this limitation, we propose a novel context optimization (CoOp)-based framework that integrates subspace representation learning with prompt tuning. Our approach improves ID-OOD separability by projecting the ID features into a subspace spanned by prompt vectors, while projecting ID-irrelevant features into an orthogonal null space. To train such OOD detection framework, we design an easy-to-handle end-to-end learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy. Experiments on real-world datasets showcase the effectiveness of our approach.
♻ ☆ Graph Your Own Prompt NeurIPS 2025
We propose Graph Consistency Regularization (GCR), a novel framework that injects relational graph structures, derived from model predictions, into the learning process to promote class-aware, semantically meaningful feature representations. Functioning as a form of self-prompting, GCR enables the model to refine its internal structure using its own outputs. While deep networks learn rich representations, these often capture noisy inter-class similarities that contradict the model's predicted semantics. GCR addresses this issue by introducing parameter-free Graph Consistency Layers (GCLs) at arbitrary depths. Each GCL builds a batch-level feature similarity graph and aligns it with a global, class-aware masked prediction graph, derived by modulating softmax prediction similarities with intra-class indicators. This alignment enforces that feature-level relationships reflect class-consistent prediction behavior, acting as a semantic regularizer throughout the network. Unlike prior work, GCR introduces a multi-layer, cross-space graph alignment mechanism with adaptive weighting, where layer importance is learned from graph discrepancy magnitudes. This allows the model to prioritize semantically reliable layers and suppress noisy ones, enhancing feature quality without modifying the architecture or training procedure. GCR is model-agnostic, lightweight, and improves semantic structure across various networks and datasets. Experiments show that GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization, offering a new perspective on learning from prediction structure. [Project website](https://darcyddx.github.io/gcr/) [Code](https://github.com/Darcyddx/graph-prompt)
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.
♻ ☆ NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes
Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.
♻ ☆ A Unified Approach Towards Active Learning and Out-of-Distribution Detection
When applying deep learning models in open-world scenarios, active learning (AL) strategies are crucial for identifying label candidates from a nearly infinite amount of unlabeled data. In this context, robust out-of-distribution (OOD) detection mechanisms are essential for handling data outside the target distribution of the application. However, current works investigate both problems separately. In this work, we introduce SISOM as the first unified solution for both AL and OOD detection. By leveraging feature space distance metrics SISOM combines the strengths of the currently independent tasks to solve both effectively. We conduct extensive experiments showing the problems arising when migrating between both tasks. In these evaluations SISOM underlined its effectiveness by achieving first place in two of the widely used OpenOOD benchmarks and second place in the remaining one. In AL, SISOM outperforms others and delivers top-1 performance in three benchmarks
♻ ☆ Invariant Feature Learning for Generalized Long-Tailed Classification ECCV 2022
Existing long-tailed classification (LT) methods only focus on tackling the class-wise imbalance that head classes have more samples than tail classes, but overlook the attribute-wise imbalance. In fact, even if the class is balanced, samples within each class may still be long-tailed due to the varying attributes. Note that the latter is fundamentally more ubiquitous and challenging than the former because attributes are not just implicit for most datasets, but also combinatorially complex, thus prohibitively expensive to be balanced. Therefore, we introduce a novel research problem: Generalized Long-Tailed classification (GLT), to jointly consider both kinds of imbalances. By "generalized", we mean that a GLT method should naturally solve the traditional LT, but not vice versa. Not surprisingly, we find that most class-wise LT methods degenerate in our proposed two benchmarks: ImageNet-GLT and MSCOCO-GLT. We argue that it is because they over-emphasize the adjustment of class distribution while neglecting to learn attribute-invariant features. To this end, we propose an Invariant Feature Learning (IFL) method as the first strong baseline for GLT. IFL first discovers environments with divergent intra-class distributions from the imperfect predictions and then learns invariant features across them. Promisingly, as an improved feature backbone, IFL boosts all the LT line-up: one/two-stage re-balance, augmentation, and ensemble. Codes and benchmarks are available on Github: https://github.com/KaihuaTang/Generalized-Long-Tailed-Benchmarks.pytorch
comment: Accepted to ECCV 2022. Codes and benchmarks are available on Github: https://github.com/KaihuaTang/Generalized-Long-Tailed-Benchmarks.pytorch
♻ ☆ Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization ECCV 2022
Out-Of-Distribution generalization (OOD) is all about learning invariance against environmental changes. If the context in every class is evenly distributed, OOD would be trivial because the context can be easily removed due to an underlying principle: class is invariant to context. However, collecting such a balanced dataset is impractical. Learning on imbalanced data makes the model bias to context and thus hurts OOD. Therefore, the key to OOD is context balance. We argue that the widely adopted assumption in prior work, the context bias can be directly annotated or estimated from biased class prediction, renders the context incomplete or even incorrect. In contrast, we point out the everoverlooked other side of the above principle: context is also invariant to class, which motivates us to consider the classes (which are already labeled) as the varying environments to resolve context bias (without context labels). We implement this idea by minimizing the contrastive loss of intra-class sample similarity while assuring this similarity to be invariant across all classes. On benchmarks with various context biases and domain gaps, we show that a simple re-weighting based classifier equipped with our context estimation achieves state-of-the-art performance. We provide the theoretical justifications in Appendix and codes on https://github.com/simpleshinobu/IRMCon.
comment: Accepted by ECCV 2022
♻ ☆ Motion-Enhanced Nonlocal Similarity Implicit Neural Representation for Infrared Dim and Small Target Detection
Infrared dim and small target detection presents a significant challenge due to dynamic multi-frame scenarios and weak target signatures in the infrared modality. Traditional low-rank plus sparse models often fail to capture dynamic backgrounds and global spatial-temporal correlations, which results in background leakage or target loss. In this paper, we propose a novel motion-enhanced nonlocal similarity implicit neural representation (INR) framework to address these challenges. We first integrate motion estimation via optical flow to capture subtle target movements, and propose multi-frame fusion to enhance motion saliency. Second, we leverage nonlocal similarity to construct patch tensors with strong low-rank properties, and propose an innovative tensor decomposition-based INR model to represent the nonlocal patch tensor, effectively encoding both the nonlocal low-rankness and spatial-temporal correlations of background through continuous neural representations. An alternating direction method of multipliers is developed for the nonlocal INR model, which enjoys theoretical fixed-point convergence. Experimental results show that our approach robustly separates dim targets from complex infrared backgrounds, outperforming state-of-the-art methods in detection accuracy and robustness.
♻ ☆ Cell as Point: One-Stage Framework for Efficient Cell Tracking
Conventional multi-stage cell tracking approaches rely heavily on detection or segmentation in each frame as a prerequisite, requiring substantial resources for high-quality segmentation masks and increasing the overall prediction time. To address these limitations, we propose CAP, a novel end-to-end one-stage framework that reimagines cell tracking by treating Cell as Point. Unlike traditional methods, CAP eliminates the need for explicit detection or segmentation, instead jointly tracking cells for sequences in one stage by leveraging the inherent correlations among their trajectories. This simplification reduces both labeling requirements and pipeline complexity. However, directly processing the entire sequence in one stage poses challenges related to data imbalance in capturing cell division events and long sequence inference. To solve these challenges, CAP introduces two key innovations: (1) adaptive event-guided (AEG) sampling, which prioritizes cell division events to mitigate the occurrence imbalance of cell events, and (2) the rolling-as-window (RAW) inference strategy, which ensures continuous and stable tracking of newly emerging cells over extended sequences. By removing the dependency on segmentation-based preprocessing while addressing the challenges of imbalanced occurrence of cell events and long-sequence tracking, CAP demonstrates promising cell tracking performance and is 8 to 32 times more efficient than existing methods. The code and model checkpoints will be available soon.
comment: 17 pages, 8 figures, 6 tables
♻ ☆ A PDE-Based Image Dehazing Method via Atmospheric Scattering Theory
This paper introduces a novel partial differential equation (PDE) framework for single-image dehazing. We embed the atmospheric scattering model into a PDE featuring edge-preserving diffusion and a nonlocal operator to maintain both local details and global structures. A key innovation is an adaptive regularization mechanism guided by the dark channel prior, which adjusts smoothing strength based on haze density. The framework's mathematical well-posedness is rigorously established by proving the existence and uniqueness of its weak solution in $H_0^1(\Omega)$. An efficient, GPU-accelerated fixed-point solver is used for implementation. Experiments confirm our method achieves effective haze removal while preserving high image fidelity, offering a principled alternative to purely data-driven techniques.
comment: report
♻ ☆ SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams
Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.
♻ ☆ CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering
Current state-of-the-art 3D reconstruction models face limitations in building extra-large scale outdoor scenes, primarily due to the lack of sufficiently large-scale and detailed datasets. In this paper, we present a extra-large fine-grained dataset with 10 billion points composed of 41,006 drone-captured high-resolution aerial images, covering 20 diverse and culturally significant scenes from worldwide locations such as Cambridge Uni main buildings, the Pyramids, and the Forbidden City Palace. Compared to existing datasets, ours offers significantly larger scale and higher detail, uniquely suited for fine-grained 3D applications. Each scene contains an accurate spatial layout and comprehensive structural information, supporting detailed 3D reconstruction tasks. By reconstructing environments using these detailed images, our dataset supports multiple applications, including outputs in the widely adopted COLMAP format, establishing a novel benchmark for evaluating state-of-the-art large-scale Gaussian Splatting methods.The dataset's flexibility encourages innovations and supports model plug-ins, paving the way for future 3D breakthroughs. All datasets and code will be open-sourced for community use.
♻ ☆ Concept Retrieval -- What and How?
A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/
♻ ☆ VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models
Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards.
♻ ☆ xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition
Recognising expressive behaviours in face videos is a long-standing challenge in Affective Computing. Despite significant advancements in recent years, it still remains a challenge to build a robust and reliable system for naturalistic and in-the-wild facial expressive behaviour analysis in real time. This paper addresses two key challenges in building such a system: (1). The paucity of large-scale labelled facial affect video datasets with extensive coverage of the 2D emotion space, and (2). The difficulty of extracting facial video features that are discriminative, interpretable, robust, and computationally efficient. Toward addressing these challenges, this work introduces xTrace, a robust tool for facial expressive behaviour analysis and predicting continuous values of dimensional emotions, namely valence and arousal, from in-the-wild face videos. To address challenge (1), the proposed affect recognition model is trained on the largest facial affect video data set, containing $\sim$450k videos that cover most emotion zones in the dimensional emotion space, making xTrace highly versatile in analysing a wide spectrum of naturalistic expressive behaviours. To address challenge (2), xTrace uses facial affect descriptors that are not only explainable, but can also achieve a high degree of accuracy and robustness with low computational complexity. The key components of xTrace are benchmarked against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox. On an in-the-wild benchmarking set composed of $\sim$50k videos, xTrace achieves 0.86 mean Concordance Correlation Coefficient (CCC) and on the SEWA test set it achieves 0.75 mean CCC, outperforming existing SOTA by $\sim$7.1\%.
♻ ☆ VORTA: Efficient Video Diffusion via Routing Sparse Attention NeurIPS 2025
Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.
comment: Accepted by NeurIPS 2025. The code is available at https://github.com/wenhao728/VORTA
♻ ☆ Shifting AI Efficiency From Model-Centric to Data-Centric Compression
The advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on scaling model parameters. However, as hardware limits constrain further model growth, the primary computational bottleneck has shifted to the quadratic cost of self-attention over increasingly long sequences by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient artificial intelligence (AI) is shifting from model-centric compression to data-centric compression}. We position data-centric compression as the emerging paradigm, which improves AI efficiency by directly compressing the volume of data processed during model training or inference. To formalize this shift, we establish a unified framework for existing efficiency strategies and demonstrate why it constitutes a crucial paradigm change for long-context AI. We then systematically review the landscape of data-centric compression methods, analyzing their benefits across diverse scenarios. Finally, we outline key challenges and promising future research directions. Our work aims to provide a novel perspective on AI efficiency, synthesize existing efforts, and catalyze innovation to address the challenges posed by ever-increasing context lengths.
comment: Project: \url{https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression}
♻ ☆ Class-Invariant Test-Time Augmentation for Domain Generalization
Deep models often suffer significant performance degradation under distribution shifts. Domain generalization (DG) seeks to mitigate this challenge by enabling models to generalize to unseen domains. Most prior approaches rely on multi-domain training or computationally intensive test-time adaptation. In contrast, we propose a complementary strategy: lightweight test-time augmentation. Specifically, we develop a novel Class-Invariant Test-Time Augmentation (CI-TTA) technique. The idea is to generate multiple variants of each input image through elastic and grid deformations that nevertheless belong to the same class as the original input. Their predictions are aggregated through a confidence-guided filtering scheme that remove unreliable outputs, ensuring the final decision relies on consistent and trustworthy cues. Extensive Experiments on PACS and Office-Home datasets demonstrate consistent gains across different DG algorithms and backbones, highlighting the effectiveness and generality of our approach.
♻ ☆ DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy NeurIPS2025
Garment manipulation is a critical challenge due to the diversity in garment categories, geometries, and deformations. Despite this, humans can effortlessly handle garments, thanks to the dexterity of our hands. However, existing research in the field has struggled to replicate this level of dexterity, primarily hindered by the lack of realistic simulations of dexterous garment manipulation. Therefore, we propose DexGarmentLab, the first environment specifically designed for dexterous (especially bimanual) garment manipulation, which features large-scale high-quality 3D assets for 15 task scenarios, and refines simulation techniques tailored for garment modeling to reduce the sim-to-real gap. Previous data collection typically relies on teleoperation or training expert reinforcement learning (RL) policies, which are labor-intensive and inefficient. In this paper, we leverage garment structural correspondence to automatically generate a dataset with diverse trajectories using only a single expert demonstration, significantly reducing manual intervention. However, even extensive demonstrations cannot cover the infinite states of garments, which necessitates the exploration of new algorithms. To improve generalization across diverse garment shapes and deformations, we propose a Hierarchical gArment-manipuLation pOlicy (HALO). It first identifies transferable affordance points to accurately locate the manipulation area, then generates generalizable trajectories to complete the task. Through extensive experiments and detailed analysis of our method and baseline, we demonstrate that HALO consistently outperforms existing methods, successfully generalizing to previously unseen instances even with significant variations in shape and deformation where others fail. Our project page is available at: https://wayrise.github.io/DexGarmentLab/.
comment: NeurIPS2025 Spotlight
♻ ☆ FSFSplatter: Build Surface and Novel Views with Sparse-Views within 2min
Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU, Replica, and BlendedMVS datasets.
♻ ☆ Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations
Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) -- all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of $\mathbf{20.01\%}$ in unsafe concept removal, is effective in style manipulation, and is $\mathbf{\sim5}$x faster than the current state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers
comment: 23 pages, 18 figures
♻ ☆ Tokenizing Motion: A Generative Approach for Scene Dynamics Compression
This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.
comment: 5page, 5 figures
♻ ☆ Evaluating YOLO Architectures: Implications for Real-Time Vehicle Detection in Urban Environments of Bangladesh
Vehicle detection systems trained on Non-Bangladeshi datasets struggle to accurately identify local vehicle types in Bangladesh's unique road environments, creating critical gaps in autonomous driving technology for developing regions. This study evaluates six YOLO model variants on a custom dataset featuring 29 distinct vehicle classes, including region-specific vehicles such as ``Desi Nosimon'', ``Leguna'', ``Battery Rickshaw'', and ``CNG''. The dataset comprises high-resolution images (1920x1080) captured across various Bangladeshi roads using mobile phone cameras and manually annotated using LabelImg with YOLO format bounding boxes. Performance evaluation revealed YOLOv11x as the top performer, achieving 63.7\% mAP@0.5, 43.8\% mAP@0.5:0.95, 61.4\% recall, and 61.6\% F1-score, though requiring 45.8 milliseconds per image for inference. Medium variants (YOLOv8m, YOLOv11m) struck an optimal balance, delivering robust detection performance with mAP@0.5 values of 62.5\% and 61.8\% respectively, while maintaining moderate inference times around 14-15 milliseconds. The study identified significant detection challenges for rare vehicle classes, with Construction Vehicles and Desi Nosimons showing near-zero accuracy due to dataset imbalances and insufficient training samples. Confusion matrices revealed frequent misclassifications between visually similar vehicles, particularly Mini Trucks versus Mini Covered Vans. This research provides a foundation for developing robust object detection systems specifically adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology advancement for developing regions where conventional generic-trained models fail to perform adequately.
♻ ☆ UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning ACM MM 2025
Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
comment: 10 pages, 6 figures, Accepted by ACM MM 2025
♻ ☆ Boosting Adversarial Transferability via Commonality-Oriented Gradient Optimization
Exploring effective and transferable adversarial examples is vital for understanding the characteristics and mechanisms of Vision Transformers (ViTs). However, adversarial examples generated from surrogate models often exhibit weak transferability in black-box settings due to overfitting. Existing methods improve transferability by diversifying perturbation inputs or applying uniform gradient regularization within surrogate models, yet they have not fully leveraged the shared and unique features of surrogate models trained on the same task, leading to suboptimal transfer performance. Therefore, enhancing perturbations of common information shared by surrogate models and suppressing those tied to individual characteristics offers an effective way to improve transferability. Accordingly, we propose a commonality-oriented gradient optimization strategy (COGO) consisting of two components: Commonality Enhancement (CE) and Individuality Suppression (IS). CE perturbs the mid-to-low frequency regions, leveraging the fact that ViTs trained on the same dataset tend to rely more on mid-to-low frequency information for classification. IS employs adaptive thresholds to evaluate the correlation between backpropagated gradients and model individuality, assigning weights to gradients accordingly. Extensive experiments demonstrate that COGO significantly improves the transfer success rates of adversarial attacks, outperforming current state-of-the-art methods.
comment: 23 pages
♻ ☆ Free-Lunch Color-Texture Disentanglement for Stylized Image Generation NeurIPS2025
Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.Code is released at https://deepffff.github.io/sadis.github.io/.
comment: Accepted by NeurIPS2025. Code is available at https://deepffff.github.io/sadis.github.io/
♻ ☆ FairDD: Fair Dataset Distillation NIPS2025
Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches (DDs), requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DDs, with a promising trade-off between fairness and accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach. Code is available at https://github.com/zqhang/FairDD.
comment: Accepted by NIPS2025
♻ ☆ Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity
Work in psychology has highlighted that the geometric model of similarity standard in deep learning is not psychologically plausible because its metric properties such as symmetry do not align with human perception of similarity. In contrast, Tversky (1977) proposed an axiomatic theory of similarity with psychological plausibility based on a representation of objects as sets of features, and their similarity as a function of their common and distinctive features. This model of similarity has not been used in deep learning before, in part because of the challenge of incorporating discrete set operations. In this paper, we develop a differentiable parameterization of Tversky's similarity that is learnable through gradient descent, and derive basic neural network building blocks such as the Tversky projection layer, which unlike the linear projection layer can model non-linear functions such as XOR. Through experiments with image recognition and language modeling neural networks, we show that the Tversky projection layer is a beneficial replacement for the linear projection layer. For instance, on the NABirds image classification task, a frozen ResNet-50 adapted with a Tversky projection layer achieves a 24.7% relative accuracy improvement over the linear layer adapter baseline. With Tversky projection layers, GPT-2's perplexity on PTB decreases by 7.8%, and its parameter count by 34.8%. Finally, we propose a unified interpretation of both types of projection layers as computing similarities of input stimuli to learned prototypes for which we also propose a novel visualization technique highlighting the interpretability of Tversky projection layers. Our work offers a new paradigm for thinking about the similarity model implicit in modern deep learning, and designing neural networks that are interpretable under an established theory of psychological similarity.
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Demand Estimation with Text and Image Data
We propose a demand estimation method that leverages unstructured text and image data to infer substitution patterns. Using pre-trained deep learning models, we extract embeddings from product images and textual descriptions and incorporate them into a random coefficients logit model. This approach enables researchers to estimate demand even when they lack data on product attributes or when consumers value hard-to-quantify attributes, such as visual design or functional benefits. Using data from a choice experiment, we show that our approach outperforms standard attribute-based models in counterfactual predictions of consumers' second choices. We also apply it across 40 product categories on Amazon and consistently find that text and image data help identify close substitutes within each category.
Machine Learning 69
☆ Discrete State Diffusion Models: A Sample Complexity Perspective
Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume score estimation error bounds without studying sample complexity results. In this work, we present a principled theoretical framework for discrete-state diffusion, providing the first sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-2})$. Our structured decomposition of the score estimation error into statistical, approximation, optimization, and clipping components offers critical insights into how discrete-state models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of discrete-state diffusion models.
☆ Glance for Context: Learning When to Leverage LLMs for Node-Aware GNN-LLM Fusion
Learning on text-attributed graphs has motivated the use of Large Language Models (LLMs) for graph learning. However, most fusion strategies are applied uniformly across all nodes and attain only small overall performance gains. We argue this result stems from aggregate metrics that obscure when LLMs provide benefit, inhibiting actionable signals for new strategies. In this work, we reframe LLM-GNN fusion around nodes where GNNs typically falter. We first show that performance can significantly differ between GNNs and LLMs, with each excelling on distinct structural patterns, such as local homophily. To leverage this finding, we propose GLANCE (GNN with LLM Assistance for Neighbor- and Context-aware Embeddings), a framework that invokes an LLM to refine a GNN's prediction. GLANCE employs a lightweight router that, given inexpensive per-node signals, decides whether to query the LLM. Since the LLM calls are non-differentiable, the router is trained with an advantage-based objective that compares the utility of querying the LLM against relying solely on the GNN. Across multiple benchmarks, GLANCE achieves the best performance balance across node subgroups, achieving significant gains on heterophilous nodes (up to $+13\%$) while simultaneously achieving top overall performance. Our findings highlight the value of adaptive, node-aware GNN-LLM architectures, where selectively invoking the LLM enables scalable deployment on large graphs without incurring high computational costs.
Fast and the Furious: Hot Starts in Pursuit-Evasion Games AAMAS
Effectively positioning pursuers in pursuit-evasion games without prior knowledge of evader locations remains a significant challenge. A novel approach that combines game-theoretic control theory with Graph Neural Networks is introduced in this work. By conceptualizing pursuer configurations as strategic arrangements and representing them as graphs, a Graph Characteristic Space is constructed via multi-objective optimization to identify Pareto-optimal configurations. A Graph Convolutional Network (GCN) is trained on these Pareto-optimal graphs to generate strategically effective initial configurations, termed "hot starts". Empirical evaluations demonstrate that the GCN-generated hot starts provide a significant advantage over random configurations. In scenarios considering multiple pursuers and evaders, this method hastens the decline in evader survival rates, reduces pursuer travel distances, and enhances containment, showcasing clear strategic benefits.
comment: Presented at AAMAS Workshop on Autonomous Robots and Multirobot Systems (ARMS)
☆ Aegis: A Correlation-Based Data Masking Advisor for Data Sharing Ecosystems
Data-sharing ecosystems enable entities -- such as providers, consumers, and intermediaries -- to access, exchange, and utilize data for various downstream tasks and applications. Due to privacy concerns, data providers typically anonymize datasets before sharing them; however, the existence of multiple masking configurations results in masked datasets with varying utility. Consequently, a key challenge lies in efficiently determining the optimal masking configuration that maximizes a dataset's utility. This paper presents AEGIS, a middleware framework for identifying the optimal masking configuration for machine learning datasets that consist of features and a class label. We introduce a utility optimizer that minimizes predictive utility deviation -- a metric based on the changes in feature-label correlations before and after masking. Our framework leverages limited data summaries (such as 1D histograms) or none to estimate the feature-label joint distribution, making it suitable for scenarios where raw data is inaccessible due to privacy restrictions. To achieve this, we propose a joint distribution estimator based on iterative proportional fitting, which allows supporting various feature-label correlation quantification methods such as g3, mutual information, or chi-square. Our experimental evaluation on real-world datasets shows that AEGIS identifies optimal masking configurations over an order of magnitude faster, while the resulting masked datasets achieve predictive performance on downstream ML tasks that is on par with baseline approaches.
comment: Accepted at SIGMOD 2026
☆ Crisis-Aware Regime-Conditioned Diffusion with CVaR Allocation
We study whether regime-conditioned generative scenarios, coupled with a convex CVaR allocator, improve portfolio decisions under regime shifts. We introduce Multi-Agent Regime-Conditioned Diffusion (MARCD), which (i) infers latent regimes via a Gaussian HMM, (ii) trains a diffusion model with a tail-weighted objective and a regime-specialized mixture-of-experts (MoE) denoiser to enrich crisis co-movements, and (iii) feeds the generated scenarios into a turnover-aware CVaR epigraph quadratic program with explicit governance. In strict walk-forward tests on liquid multi-asset ETFs (2005-2025), MARCD outperforms standard allocators and improves calibration relative to popular generators. Over 2020-2025 out-of-sample (monthly; 10 bps), MARCD attains Sharpe 1.23 (BL 1.02) and MaxDD 9.3 percent (BL 14.1 percent), a 34 percent reduction, at comparable turnover; stationary block-bootstrap intervals indicate the Sharpe uplift is significant at 5 percent. We provide theory linking tail-weighted diffusion to spectral-risk control of the decision-relevant CVaR gap, oracle/consistency results for the regime-MoE denoiser, and Lipschitz/regret guarantees for the allocator. Together, MARCD offers a reproducible bridge from tail-faithful scenario modeling to governed portfolio decisions with materially improved drawdown control.
comment: Code available at: https://github.com/AliAtiah/MARCD
☆ Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures
Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.
comment: Waiting for Conference Response
☆ MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation
Clouds remain a critical challenge in optical satellite imagery, hindering reliable analysis for environmental monitoring, land cover mapping, and climate research. To overcome this, we propose MSCloudCAM, a Cross-Attention with Multi-Scale Context Network tailored for multispectral and multi-sensor cloud segmentation. Our framework exploits the spectral richness of Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data to classify four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow. MSCloudCAM combines a Swin Transformer backbone for hierarchical feature extraction with multi-scale context modules ASPP and PSP for enhanced scale-aware learning. A Cross-Attention block enables effective multisensor and multispectral feature fusion, while the integration of an Efficient Channel Attention Block (ECAB) and a Spatial Attention Module adaptively refine feature representations. Comprehensive experiments on CloudSEN12 and L8Biome demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy, surpassing leading baseline architectures while maintaining competitive parameter efficiency and FLOPs. These results underscore the model's effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.
comment: 7 pages, 2 Figures
☆ PruneGCRN: Minimizing and explaining spatio-temporal problems through node pruning
This work addresses the challenge of using a deep learning model to prune graphs and the ability of this method to integrate explainability into spatio-temporal problems through a new approach. Instead of applying explainability to the model's behavior, we seek to gain a better understanding of the problem itself. To this end, we propose a novel model that integrates an optimized pruning mechanism capable of removing nodes from the graph during the training process, rather than doing so as a separate procedure. This integration allows the architecture to learn how to minimize prediction error while selecting the most relevant nodes. Thus, during training, the model searches for the most relevant subset of nodes, obtaining the most important elements of the problem, facilitating its analysis. To evaluate the proposed approach, we used several widely used traffic datasets, comparing the accuracy obtained by pruning with the model and with other methods. The experiments demonstrate that our method is capable of retaining a greater amount of information as the graph reduces in size compared to the other methods used. These results highlight the potential of pruning as a tool for developing models capable of simplifying spatio-temporal problems, thereby obtaining their most important elements.
☆ Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage
Recent advances in machine learning such as Long Short-Term Memory (LSTM) models and Transformers have been widely adopted in hydrological applications, demonstrating impressive performance amongst deep learning models and outperforming physical models in various tasks. However, their superiority in predicting land surface states such as terrestrial water storage (TWS) that are dominated by many factors such as natural variability and human driven modifications remains unclear. Here, using the open-access, globally representative HydroGlobe dataset - comprising a baseline version derived solely from a land surface model simulation and an advanced version incorporating multi-source remote sensing data assimilation - we show that linear regression is a robust benchmark, outperforming the more complex LSTM and Temporal Fusion Transformer for TWS prediction. Our findings highlight the importance of including traditional statistical models as benchmarks when developing and evaluating deep learning models. Additionally, we emphasize the critical need to establish globally representative benchmark datasets that capture the combined impact of natural variability and human interventions.
☆ BioOSS: A Bio-Inspired Oscillatory State System with Spatio-Temporal Dynamics
Today's deep learning architectures are primarily based on perceptron models, which do not capture the oscillatory dynamics characteristic of biological neurons. Although oscillatory systems have recently gained attention for their closer resemblance to neural behavior, they still fall short of modeling the intricate spatio-temporal interactions observed in natural neural circuits. In this paper, we propose a bio-inspired oscillatory state system (BioOSS) designed to emulate the wave-like propagation dynamics critical to neural processing, particularly in the prefrontal cortex (PFC), where complex activity patterns emerge. BioOSS comprises two interacting populations of neurons: p neurons, which represent simplified membrane-potential-like units inspired by pyramidal cells in cortical columns, and o neurons, which govern propagation velocities and modulate the lateral spread of activity. Through local interactions, these neurons produce wave-like propagation patterns. The model incorporates trainable parameters for damping and propagation speed, enabling flexible adaptation to task-specific spatio-temporal structures. We evaluate BioOSS on both synthetic and real-world tasks, demonstrating superior performance and enhanced interpretability compared to alternative architectures.
☆ Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent
comment: 22 pages, 2 figures, 8 tables
☆ Structure Over Signal: A Globalized Approach to Multi-relational GNNs for Stock Prediction
In financial markets, Graph Neural Networks have been successfully applied to modeling relational data, effectively capturing nonlinear inter-stock dependencies. Yet, existing models often fail to efficiently propagate messages during macroeconomic shocks. In this paper, we propose OmniGNN, an attention-based multi-relational dynamic GNN that integrates macroeconomic context via heterogeneous node and edge types for robust message passing. Central to OmniGNN is a sector node acting as a global intermediary, enabling rapid shock propagation across the graph without relying on long-range multi-hop diffusion. The model leverages Graph Attention Networks (GAT) to weigh neighbor contributions and employs Transformers to capture temporal dynamics across multiplex relations. Experiments show that OmniGNN outperforms existing stock prediction models on public datasets, particularly demonstrating strong robustness during the COVID-19 period.
☆ ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for TTS applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and multi-dimensional quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies and to serve as a template for other low-resource languages. The ParsVoice dataset is publicly available at ParsVoice (https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice).
☆ Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.
☆ Optimally Deep Networks -- Adapting Model Depth to Datasets for Superior Efficiency
Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.
comment: 6 pages, 3 figures, 1 table
☆ How Patterns Dictate Learnability in Sequential Data NeurIPS 2025
Sequential data - ranging from financial time series to natural language - has driven the growing adoption of autoregressive models. However, these algorithms rely on the presence of underlying patterns in the data, and their identification often depends heavily on human expertise. Misinterpreting these patterns can lead to model misspecification, resulting in increased generalization error and degraded performance. The recently proposed evolving pattern (EvoRate) metric addresses this by using the mutual information between the next data point and its past to guide regression order estimation and feature selection. Building on this idea, we introduce a general framework based on predictive information, defined as the mutual information between the past and the future, $I(X_{past}; X_{future})$. This quantity naturally defines an information-theoretic learning curve, which quantifies the amount of predictive information available as the observation window grows. Using this formalism, we show that the presence or absence of temporal patterns fundamentally constrains the learnability of sequential models: even an optimal predictor cannot outperform the intrinsic information limit imposed by the data. We validate our framework through experiments on synthetic data, demonstrating its ability to assess model adequacy, quantify the inherent complexity of a dataset, and reveal interpretable structure in sequential data.
comment: NeurIPS 2025, 36 pages, 4 figures
☆ Seeing My Future: Predicting Situated Interaction Behavior in Virtual Reality
Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors - such as gaze direction and object interactions - which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.
comment: Project Page: https://xy02-05.github.io/Seeing_My_Future/
☆ A Stochastic Differential Equation Framework for Multi-Objective LLM Interactions: Dynamical Systems Analysis with Code Generation Applications NeurIPS 2025
We introduce a general stochastic differential equation framework for modelling multiobjective optimization dynamics in iterative Large Language Model (LLM) interactions. Our framework captures the inherent stochasticity of LLM responses through explicit diffusion terms and reveals systematic interference patterns between competing objectives via an interference matrix formulation. We validate our theoretical framework using iterative code generation as a proof-of-concept application, analyzing 400 sessions across security, efficiency, and functionality objectives. Our results demonstrate strategy-dependent convergence behaviors with rates ranging from 0.33 to 1.29, and predictive accuracy achieving R2 = 0.74 for balanced approaches. This work proposes the feasibility of dynamical systems analysis for multi-objective LLM interactions, with code generation serving as an initial validation domain.
comment: Peer-reviewed and accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) DynaFront 2025 Workshop (https://sites.google.com/view/dynafrontneurips25)
☆ Controllable Generative Trajectory Prediction via Weak Preference Alignment
Deep generative models such as conditional variational autoencoders (CVAEs) have shown great promise for predicting trajectories of surrounding agents in autonomous vehicle planning. State-of-the-art models have achieved remarkable accuracy in such prediction tasks. Besides accuracy, diversity is also crucial for safe planning because human behaviors are inherently uncertain and multimodal. However, existing methods generally lack a scheme to generate controllably diverse trajectories, which is arguably more useful than randomly diversified trajectories, to the end of safe planning. To address this, we propose PrefCVAE, an augmented CVAE framework that uses weakly labeled preference pairs to imbue latent variables with semantic attributes. Using average velocity as an example attribute, we demonstrate that PrefCVAE enables controllable, semantically meaningful predictions without degrading baseline accuracy. Our results show the effectiveness of preference supervision as a cost-effective way to enhance sampling-based generative models.
☆ Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits
We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (\texttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling (\texttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\mathcal{O}(d^{3/2} \sqrt{T} + d^{9/2})$ for \texttt{GLM-ES} and $\mathcal{O}(\widetilde{d} \sqrt{T})$ for \texttt{Neural-ES}, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate \texttt{GLM-ES}, \texttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.
comment: 40 pages, 1 figure
☆ Deep Signature and Neural RDE Methods for Path-Dependent Portfolio Optimization
We present a deep BSDE and 2BSDE solver that combines truncated log signatures with a neural rough differential equation backbone for high dimensional, path dependent valuation and control. The design aligns stochastic analysis with sequence to path learning, using a CVaR tilted objective to emphasize left tail risk and an optional second order head for risk sensitive control. Under equal compute and parameter budgets, the method improves accuracy, tail fidelity, and training stability across Asian and barrier option pricing and portfolio control tasks. At 200 dimensions, it achieves CVaR(0.99) = 9.8 percent compared to 12.0-13.1 percent for strong baselines, while attaining low HJB residuals and small RMSE for Z and Gamma. Ablations confirm complementary gains from the sequence to path representation and the second order structure. Overall, the results show that combining stochastic analysis with modern deep learning expands the class of solvable path dependent financial models at scale.
comment: Accepted for presentation at the ACM International Conference on AI in Finance (ICAIF 2025), QuantAI Workshop, Singapore. 9 pages. Code available at: https://github.com/AliAtiah/SigRDE
☆ Missing Data Multiple Imputation for Tabular Q-Learning in Online RL
Missing data in online reinforcement learning (RL) poses challenges compared to missing data in standard tabular data or in offline policy learning. The need to impute and act at each time step means that imputation cannot be put off until enough data exist to produce stable imputation models. It also means future data collection and learning depend on previous imputations. This paper proposes fully online imputation ensembles. We find that maintaining multiple imputation pathways may help balance the need to capture uncertainty under missingness and the need for efficiency in online settings. We consider multiple approaches for incorporating these pathways into learning and action selection. Using a Grid World experiment with various types of missingness, we provide preliminary evidence that multiple imputation pathways may be a useful framework for constructing simple and efficient online missing data RL methods.
comment: Working paper
☆ Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance
The generation of trees with a specified tree edit distance has significant applications across various fields, including computational biology, structured data analysis, and image processing. Recently, generative networks have been increasingly employed to synthesize new data that closely resembles the original datasets. However, the appropriate size and depth of generative networks required to generate data with a specified tree edit distance remain unclear. In this paper, we theoretically establish the existence and construction of generative networks capable of producing trees similar to a given tree with respect to the tree edit distance. Specifically, for a given rooted, ordered, and vertex-labeled tree T of size n + 1 with labels from an alphabet \Sigma, and a non-negative integer d, we prove that all rooted, ordered, and vertex-labeled trees over \Sigma with tree edit distance at most d from T can be generated using a ReLU-based generative network with size O(n^3 ) and constant depth. The proposed networks were implemented and evaluated for generating trees with up to 21 nodes. Due to their deterministic architecture, the networks successfully generated all valid trees within the specified tree edit distance. In contrast, state-of-the-art graph generative models GraphRNN and GraphGDP, which rely on non-deterministic mechanisms, produced significantly fewer valid trees, achieving validation rates of only up to 35% and 48%, respectively. These findings provide a theoretical foundation towards construction of compact generative models and open new directions for exact and valid tree-structured data generation. An implementation of the proposed networks is available at https://github.com/MGANN-KU/TreeGen_ReLUNetworks.
☆ Learning-Augmented Streaming Algorithms for Correlation Clustering NeurIPS 2025
We study streaming algorithms for Correlation Clustering. Given a graph as an arbitrary-order stream of edges, with each edge labeled as positive or negative, the goal is to partition the vertices into disjoint clusters, such that the number of disagreements is minimized. In this paper, we give the first learning-augmented streaming algorithms for the problem on both complete and general graphs, improving the best-known space-approximation tradeoffs. Based on the works of Cambus et al. (SODA'24) and Ahn et al. (ICML'15), our algorithms use the predictions of pairwise distances between vertices provided by a predictor. For complete graphs, our algorithm achieves a better-than-$3$ approximation under good prediction quality, while using $\tilde{O}(n)$ total space. For general graphs, our algorithm achieves an $O(\log |E^-|)$ approximation under good prediction quality using $\tilde{O}(n)$ total space, improving the best-known non-learning algorithm in terms of space efficiency. Experimental results on synthetic and real-world datasets demonstrate the superiority of our proposed algorithms over their non-learning counterparts.
comment: NeurIPS 2025
Attention-Enhanced LSTM Modeling for Improved Temperature and Rainfall Forecasting in Bangladesh
Accurate climate forecasting is vital for Bangladesh, a region highly susceptible to climate change impacts on temperature and rainfall. Existing models often struggle to capture long-range dependencies and complex temporal patterns in climate data. This study introduces an advanced Long Short-Term Memory (LSTM) model integrated with an attention mechanism to enhance the prediction of temperature and rainfall dynamics. Utilizing comprehensive datasets from 1901-2023, sourced from NASA's POWER Project for temperature and the Humanitarian Data Exchange for rainfall, the model effectively captures seasonal and long-term trends. It outperforms baseline models, including XGBoost, Simple LSTM, and GRU, achieving a test MSE of 0.2411 (normalized units), MAE of 0.3860 degrees C, R^2 of 0.9834, and NRMSE of 0.0370 for temperature, and MSE of 1283.67 mm^2, MAE of 22.91 mm, R^2 of 0.9639, and NRMSE of 0.0354 for rainfall on monthly forecasts. The model demonstrates improved robustness with only a 20 percent increase in MSE under simulated climate trends (compared to an approximately 2.2-fold increase in baseline models without trend features) and a 50 percent degradation under regional variations (compared to an approximately 4.8-fold increase in baseline models without enhancements). These results highlight the model's ability to improve forecasting precision and offer potential insights into the physical processes governing climate variability in Bangladesh, supporting applications in climate-sensitive sectors.
☆ Mean-square and linear convergence of a stochastic proximal point algorithm in metric spaces of nonpositive curvature
We define a stochastic variant of the proximal point algorithm in the general setting of nonlinear (separable) Hadamard spaces for approximating zeros of the mean of a stochastically perturbed monotone vector field and prove its convergence under a suitable strong monotonicity assumption, together with a probabilistic independence assumption and a separability assumption on the tangent spaces. As a particular case, our results transfer previous work by P. Bianchi on that method in Hilbert spaces for the first time to Hadamard manifolds. Moreover, our convergence proof is fully effective and allows for the construction of explicit rates of convergence for the iteration towards the (unique) solution both in mean and almost surely. These rates are moreover highly uniform, being independent of most data surrounding the iteration, space or distribution. In that generality, these rates are novel already in the context of Hilbert spaces. Linear nonasymptotic guarantees under additional second-moment conditions on the Yosida approximates and special cases of stochastic convex minimization are discussed.
comment: 24 pages
☆ Stock Prediction via a Dual Relation Fusion Network incorporating Static and Dynamic Relations
Accurate modeling of inter-stock relationships is critical for stock price forecasting. However, existing methods predominantly focus on single-state relationships, neglecting the essential complementarity between dynamic and static inter-stock relations. To solve this problem, we propose a Dual Relation Fusion Network (DRFN) to capture the long-term relative stability of stock relation structures while retaining the flexibility to respond to sudden market shifts. Our approach features a novel relative static relation component that models time-varying long-term patterns and incorporates overnight informational influences. We capture dynamic inter-stock relationships through distance-aware mechanisms, while evolving long-term structures via recurrent fusion of dynamic relations from the prior day with the pre-defined static relations. Experiments demonstrate that our method significantly outperforms the baselines across different markets, with high sensitivity to the co-movement of relational strength and stock price.
comment: 11 pages
☆ Digital Twin-enabled Multi-generation Control Co-Design with Deep Reinforcement Learning
Control Co-Design (CCD) integrates physical and control system design to improve the performance of dynamic and autonomous systems. Despite advances in uncertainty-aware CCD methods, real-world uncertainties remain highly unpredictable. Multi-generation design addresses this challenge by considering the full lifecycle of a product: data collected from each generation informs the design of subsequent generations, enabling progressive improvements in robustness and efficiency. Digital Twin (DT) technology further strengthens this paradigm by creating virtual representations that evolve over the lifecycle through real-time sensing, model updating, and adaptive re-optimization. This paper presents a DT-enabled CCD framework that integrates Deep Reinforcement Learning (DRL) to jointly optimize physical design and controller. DRL accelerates real-time decision-making by allowing controllers to continuously learn from data and adapt to uncertain environments. Extending this approach, the framework employs a multi-generation paradigm, where each cycle of deployment, operation, and redesign uses collected data to refine DT models, improve uncertainty quantification through quantile regression, and inform next-generation designs of both physical components and controllers. The framework is demonstrated on an active suspension system, where DT-enabled learning from road conditions and driving behaviors yields smoother and more stable control trajectories. Results show that the method significantly enhances dynamic performance, robustness, and efficiency. Contributions of this work include: (1) extending CCD into a lifecycle-oriented multi-generation framework, (2) leveraging DTs for continuous model updating and informed design, and (3) employing DRL to accelerate adaptive real-time decision-making.
comment: to be published in Journal of Mechanical Design
☆ High-Dimensional Learning Dynamics of Quantized Models with Straight-Through Estimator
Quantized neural network training optimizes a discrete, non-differentiable objective. The straight-through estimator (STE) enables backpropagation through surrogate gradients and is widely used. While previous studies have primarily focused on the properties of surrogate gradients and their convergence, the influence of quantization hyperparameters, such as bit width and quantization range, on learning dynamics remains largely unexplored. We theoretically show that in the high-dimensional limit, STE dynamics converge to a deterministic ordinary differential equation. This reveals that STE training exhibits a plateau followed by a sharp drop in generalization error, with plateau length depending on the quantization range. A fixed-point analysis quantifies the asymptotic deviation from the unquantized linear model. We also extend analytical techniques for stochastic gradient descent to nonlinear transformations of weights and inputs.
comment: 27 pages, 14 figures
☆ Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits NeurIPS 2025
Heavy-tailed noise is pervasive in modern machine learning applications, arising from data heterogeneity, outliers, and non-stationary stochastic environments. While second-order methods can significantly accelerate convergence in light-tailed or bounded-noise settings, such algorithms are often brittle and lack guarantees under heavy-tailed noise -- precisely the regimes where robustness is most critical. In this work, we take a first step toward a theoretical understanding of second-order optimization under heavy-tailed noise. We consider a setting where stochastic gradients and Hessians have only bounded $p$-th moments, for some $p\in (1,2]$, and establish tight lower bounds on the sample complexity of any second-order method. We then develop a variant of normalized stochastic gradient descent that leverages second-order information and provably matches these lower bounds. To address the instability caused by large deviations, we introduce a novel algorithm based on gradient and Hessian clipping, and prove high-probability upper bounds that nearly match the fundamental limits. Our results provide the first comprehensive sample complexity characterization for second-order optimization under heavy-tailed noise. This positions Hessian clipping as a robust and theoretically sound strategy for second-order algorithm design in heavy-tailed regimes.
comment: Accepted for publication at NeurIPS 2025
Interactive Atmospheric Composition Emulation for Next-Generation Earth System Models
Interactive composition simulations in Earth System Models (ESMs) are computationally expensive as they transport numerous gaseous and aerosol tracers at each timestep. This limits higher-resolution transient climate simulations with current computational resources. ESMs like NASA GISS-ModelE3 (ModelE) often use pre-computed monthly-averaged atmospheric composition concentrations (Non-Interactive Tracers or NINT) to reduce computational costs. While NINT significantly cuts computations, it fails to capture real-time feedback between aerosols and other climate processes by relying on pre-calculated fields. We extended the ModelE NINT version using machine learning (ML) to create Smart NINT, which emulates interactive emissions. Smart NINT interactively calculates concentrations using ML with surface emissions and meteorological data as inputs, avoiding full physics parameterizations. Our approach utilizes a spatiotemporal architecture that possesses a well-matched inductive bias to effectively capture the spatial and temporal dependencies in tracer evolution. Input data processed through the first 20 vertical levels (from the surface up to 656 hPa) using the ModelE OMA scheme. This vertical range covers nearly the entire BCB concentration distribution in the troposphere, where significant variation on short time horizons due to surface-level emissions is observed. Our evaluation shows excellent model performance with R-squared values of 0.92 and Pearson-r of 0.96 at the first pressure level. This high performance continues through level 15 (808.5 hPa), then gradually decreases as BCB concentrations drop significantly. The model maintains acceptable performance even when tested on data from entirely different periods outside the training domain, which is a crucial capability for climate modeling applications requiring reliable long-term projections.
☆ Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers
Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.
☆ Automatic Piecewise Linear Regression for Predicting Student Learning Satisfaction
Although student learning satisfaction has been widely studied, modern techniques such as interpretable machine learning and neural networks have not been sufficiently explored. This study demonstrates that a recent model that combines boosting with interpretability, automatic piecewise linear regression(APLR), offers the best fit for predicting learning satisfaction among several state-of-the-art approaches. Through the analysis of APLR's numerical and visual interpretations, students' time management and concentration abilities, perceived helpfulness to classmates, and participation in offline courses have the most significant positive impact on learning satisfaction. Surprisingly, involvement in creative activities did not positively affect learning satisfaction. Moreover, the contributing factors can be interpreted on an individual level, allowing educators to customize instructions according to student profiles.
☆ ProteinAE: Protein Diffusion Autoencoders for Structure Encoding
Developing effective representations of protein structures is essential for advancing protein science, particularly for protein generative modeling. Current approaches often grapple with the complexities of the SE(3) manifold, rely on discrete tokenization, or the need for multiple training objectives, all of which can hinder the model optimization and generalization. We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder designed to overcome these challenges by directly mapping protein backbone coordinates from E(3) into a continuous, compact latent space. ProteinAE employs a non-equivariant Diffusion Transformer with a bottleneck design for efficient compression and is trained end-to-end with a single flow matching objective, substantially simplifying the optimization pipeline. We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders. The resulting latent space serves as a powerful foundation for a latent diffusion model that bypasses the need for explicit equivariance. This enables efficient, high-quality structure generation that is competitive with leading structure-based approaches and significantly outperforms prior latent-based methods. Code is available at https://github.com/OnlyLoveKFC/ProteinAE_v1.
☆ GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus
Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in expressiveness due to low-rank projection structures and overly uniform attention distributions. We theoretically prove that these properties reduce the class separability of node representations, limiting the model's classification ability. To address this, we propose a novel hybrid framework that enhances both the rank and focus of attention. Specifically, we enhance linear attention by attaching a gated local graph network branch to the value matrix, thereby increasing the rank of the resulting attention map. Furthermore, to alleviate the excessive smoothing effect inherent in linear attention, we introduce a learnable log-power function into the attention scores to reduce entropy and sharpen focus. We theoretically show that this function decreases entropy in the attention distribution, enhancing the separability of learned embeddings. Extensive experiments on both homophilic and heterophilic graph benchmarks demonstrate that our method achieves competitive performance while preserving the scalability of linear attention.
☆ ImpMIA: Leveraging Implicit Bias for Membership Inference Attack under Realistic Scenarios
Determining which data samples were used to train a model-known as Membership Inference Attack (MIA)-is a well-studied and important problem with implications for data privacy. Black-box methods presume access only to the model's outputs and often rely on training auxiliary reference models. While they have shown strong empirical performance, they rely on assumptions that rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. In this paper, we demonstrate that removing these assumptions leads to a significant drop in the performance of black-box attacks. We introduce ImpMIA, a Membership Inference Attack that exploits the Implicit Bias of neural networks, hence removes the need to rely on any reference models and their assumptions. ImpMIA is a white-box attack -- a setting which assumes access to model weights and is becoming increasingly realistic given that many models are publicly available (e.g., via Hugging Face). Building on maximum-margin implicit bias theory, ImpMIA uses the Karush-Kuhn-Tucker (KKT) optimality conditions to identify training samples. This is done by finding the samples whose gradients most strongly reconstruct the trained model's parameters. As a result, ImpMIA achieves state-of-the-art performance compared to both black and white box attacks in realistic settings where only the model weights and a superset of the training data are available.
☆ SDG-L: A Semiparametric Deep Gaussian Process based Framework for Battery Capacity Prediction
Lithium-ion batteries are becoming increasingly omnipresent in energy supply. However, the durability of energy storage using lithium-ion batteries is threatened by their dropping capacity with the growing number of charging/discharging cycles. An accurate capacity prediction is the key to ensure system efficiency and reliability, where the exploitation of battery state information in each cycle has been largely undervalued. In this paper, we propose a semiparametric deep Gaussian process regression framework named SDG-L to give predictions based on the modeling of time series battery state data. By introducing an LSTM feature extractor, the SDG-L is specially designed to better utilize the auxiliary profiling information during charging/discharging process. In experimental studies based on NASA dataset, our proposed method obtains an average test MSE error of 1.2%. We also show that SDG-L achieves better performance compared to existing works and validate the framework using ablation studies.
☆ DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism
Context parallelism has emerged as a key technique to support long-context training, a growing trend in generative AI for modern large models. However, existing context parallel methods rely on static parallelization configurations that overlook the dynamic nature of training data, specifically, the variability in sequence lengths and token relationships (i.e., attention patterns) across samples. As a result, these methods often suffer from unnecessary communication overhead and imbalanced computation. In this paper, we present DCP, a dynamic context parallel training framework that introduces fine-grained blockwise partitioning of both data and computation. By enabling flexible mapping of data and computation blocks to devices, DCP can adapt to varying sequence characteristics, effectively reducing communication and improving memory and computation balance. Micro-benchmarks demonstrate that DCP accelerates attention by 1.19x~2.45x under causal masks and 2.15x~3.77x under sparse attention patterns. Additionally, we observe up to 0.94x~1.16x end-to-end training speed-up for causal masks, and 1.00x~1.46x for sparse masks.
comment: 16 pages, 22 figures
☆ Encoder Decoder Generative Adversarial Network Model for Stock Market Prediction
Forecasting stock prices remains challenging due to the volatile and non-linear nature of financial markets. Despite the promise of deep learning, issues such as mode collapse, unstable training, and difficulty in capturing temporal and feature level correlations have limited the applications of GANs in this domain. We propose a GRU-based Encoder-Decoder GAN (EDGAN) model that strikes a balance between expressive power and simplicity. The model introduces key innovations such as a temporal decoder with residual connections for precise reconstruction, conditioning on static and dynamic covariates for contextual learning, and a windowing mechanism to capture temporal dynamics. Here, the generator uses a dense encoder-decoder framework with residual GRU blocks. Extensive experiments on diverse stock datasets demonstrate that EDGAN achieves superior forecasting accuracy and training stability, even in volatile markets. It consistently outperforms traditional GAN variants in forecasting accuracy and convergence stability under market conditions.
☆ Budget Allocation for Unknown Value Functions in a Lipschitz Space
Building learning models frequently requires evaluating numerous intermediate models. Examples include models considered during feature selection, model structure search, and parameter tunings. The evaluation of an intermediate model influences subsequent model exploration decisions. Although prior knowledge can provide initial quality estimates, true performance is only revealed after evaluation. In this work, we address the challenge of optimally allocating a bounded budget to explore the space of intermediate models. We formalize this as a general budget allocation problem over unknown-value functions within a Lipschitz space.
☆ FusionGen: Feature Fusion-Based Few-Shot EEG Data Generation
Brain-computer interfaces (BCIs) provide potential for applications ranging from medical rehabilitation to cognitive state assessment by establishing direct communication pathways between the brain and external devices via electroencephalography (EEG). However, EEG-based BCIs are severely constrained by data scarcity and significant inter-subject variability, which hinder the generalization and applicability of EEG decoding models in practical settings. To address these challenges, we propose FusionGen, a novel EEG data generation framework based on disentangled representation learning and feature fusion. By integrating features across trials through a feature matching fusion module and combining them with a lightweight feature extraction and reconstruction pipeline, FusionGen ensures both data diversity and trainability under limited data constraints. Extensive experiments on multiple publicly available EEG datasets demonstrate that FusionGen significantly outperforms existing augmentation techniques, yielding notable improvements in classification accuracy.
☆ Compositional Symmetry as Compression: Lie Pseudogroup Structure in Algorithmic Agents
In the algorithmic (Kolmogorov) view, agents are programs that track and compress sensory streams using generative programs. We propose a framework where the relevant structural prior is simplicity (Solomonoff) understood as \emph{compositional symmetry}: natural streams are well described by (local) actions of finite-parameter Lie pseudogroups on geometrically and topologically complex low-dimensional configuration manifolds (latent spaces). Modeling the agent as a generic neural dynamical system coupled to such streams, we show that accurate world-tracking imposes (i) \emph{structural constraints} -- equivariance of the agent's constitutive equations and readouts -- and (ii) \emph{dynamical constraints}: under static inputs, symmetry induces conserved quantities (Noether-style labels) in the agent dynamics and confines trajectories to reduced invariant manifolds; under slow drift, these manifolds move but remain low-dimensional. This yields a hierarchy of reduced manifolds aligned with the compositional factorization of the pseudogroup, providing a geometric account of the ``blessing of compositionality'' in deep models. We connect these ideas to the Spencer formalism for Lie pseudogroups and formulate a symmetry-based, self-contained version of predictive coding in which higher layers receive only \emph{coarse-grained residual transformations} (prediction-error coordinates) along symmetry directions unresolved at lower layers.
comment: Submitted to NeurReps 2025 (https://www.neurreps.org)
☆ Deep semi-supervised approach based on consistency regularization and similarity learning for weeds classification
Weed species classification represents an important step for the development of automated targeting systems that allow the adoption of precision agriculture practices. To reduce costs and yield losses caused by their presence. The identification of weeds is a challenging problem due to their shared similarities with crop plants and the variability related to the differences in terms of their types. Along with the variations in relation to changes in field conditions. Moreover, to fully benefit from deep learning-based methods, large fully annotated datasets are needed. This requires time intensive and laborious process for data labeling, which represents a limitation in agricultural applications. Hence, for the aim of improving the utilization of the unlabeled data, regarding conditions of scarcity in terms of the labeled data available during the learning phase and provide robust and high classification performance. We propose a deep semi-supervised approach, that combines consistency regularization with similarity learning. Through our developed deep auto-encoder architecture, experiments realized on the DeepWeeds dataset and inference in noisy conditions demonstrated the effectiveness and robustness of our method in comparison to state-of-the-art fully supervised deep learning models. Furthermore, we carried out ablation studies for an extended analysis of our proposed joint learning strategy.
comment: Submitted to EURASIP Journal on Image and Video Processing
☆ Understanding Self-supervised Contrastive Learning through Supervised Objectives
Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.
comment: Accepted at TMLR 2025
☆ Multitask Learning with Learned Task Relationships
Classical consensus-based strategies for federated and decentralized learning are statistically suboptimal in the presence of heterogeneous local data or task distributions. As a result, in recent years, there has been growing interest in multitask or personalized strategies, which allow individual agents to benefit from one another in pursuing locally optimal models without enforcing consensus. Existing strategies require either precise prior knowledge of the underlying task relationships or are fully non-parametric and instead rely on meta-learning or proximal constructions. In this work, we introduce an algorithmic framework that strikes a balance between these extremes. By modeling task relationships through a Gaussian Markov Random Field with an unknown precision matrix, we develop a strategy that jointly learns both the task relationships and the local models, allowing agents to self-organize in a way consistent with their individual data distributions. Our theoretical analysis quantifies the quality of the learned relationship, and our numerical experiments demonstrate its practical effectiveness.
♻ ☆ BALF: Budgeted Activation-Aware Low-Rank Factorization for Fine-Tuning-Free Model Compression
Neural network compression techniques typically require expensive fine-tuning or search procedures, rendering them impractical on commodity hardware. Inspired by recent LLM compression research, we present a general activation-aware factorization framework that can be applied to a broad range of layers. Moreover, we introduce a scalable budgeted rank allocator that allows flexible control over compression targets (e.g., retaining 50% of parameters) with no overhead. Together, these components form BALF, an efficient pipeline for compressing models without fine-tuning. We demonstrate its effectiveness across multiple scales and architectures, from ResNet-20 on CIFAR-10 to ResNeXt-101 and vision transformers on ImageNet, and show that it achieves excellent results in the fine-tuning-free regime. For instance, BALF reduces FLOPs on ResNeXt-101 by 45% with only a 1-percentage-point top-1 accuracy drop.
♻ ☆ Noise Injection Systemically Degrades Large Language Model Safety Guardrails
Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.
comment: 9 pages,3 figures
♻ ☆ GradMetaNet: An Equivariant Architecture for Learning on Gradients
Gradients of neural networks encode valuable information for optimization, editing, and analysis of models. Therefore, practitioners often treat gradients as inputs to task-specific algorithms, e.g. for pruning or optimization. Recent works explore learning algorithms that operate directly on gradients but use architectures that are not specifically designed for gradient processing, limiting their applicability. In this paper, we present a principled approach for designing architectures that process gradients. Our approach is guided by three principles: (1) equivariant design that preserves neuron permutation symmetries, (2) processing sets of gradients across multiple data points to capture curvature information, and (3) efficient gradient representation through rank-1 decomposition. Based on these principles, we introduce GradMetaNet, a novel architecture for learning on gradients, constructed from simple equivariant blocks. We prove universality results for GradMetaNet, and show that previous approaches cannot approximate natural gradient-based functions that GradMetaNet can. We then demonstrate GradMetaNet's effectiveness on a diverse set of gradient-based tasks on MLPs and transformers, such as learned optimization, INR editing, and estimating loss landscape curvature.
♻ ☆ Methods to improve run time of hydrologic models: opportunities and challenges in the machine learning era
The application of Machine Learning (ML) to hydrologic modeling is fledgling. Its applicability to capture the dependencies on watersheds to forecast better within a short period is fascinating. One of the key reasons to adopt ML algorithms over physics-based models is its computational efficiency advantage and flexibility to work with various data sets. The diverse applications, particularly in emergency response and expanding over a large scale, demand the hydrological model in a short time and make researchers adopt data-driven modeling approaches unhesitatingly. In this work, in the era of ML and deep learning (DL), how it can help to improve the overall run time of physics-based model and potential constraints that should be addressed while modeling. This paper covers the opportunities and challenges of adopting ML for hydrological modeling and subsequently how it can help to improve the simulation time of physics-based models and future works that should be addressed.
♻ ☆ VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills NeurIPS 2025
In self-supervised reinforcement learning (RL), one of the key challenges is learning a diverse set of skills to prepare agents for unknown future tasks. Despite impressive advances, scalability and evaluation remain prevalent issues. Regarding scalability, the search for meaningful skills can be obscured by high-dimensional feature spaces, where relevant features may vary across downstream task domains. For evaluating skill diversity, defining what constitutes "diversity" typically requires a hard commitment to a specific notion of what it means for skills to be diverse, potentially leading to inconsistencies in how skill diversity is understood, making results across different approaches hard to compare, and leaving many forms of diversity unexplored. To address these issues, we adopt a measure of sample diversity that translates ideas from ecology to machine learning -- the Vendi Score -- allowing the user to specify and evaluate any desired form of diversity. We demonstrate how this metric facilitates skill evaluation and introduce VendiRL, a unified framework for learning diversely diverse sets of skills. Given distinct similarity functions, VendiRL motivates distinct forms of diversity, which could support skill-diversity pretraining in new and richly interactive environments where optimising for various forms of diversity may be desirable.
comment: 17 pages including appendices, full paper at the Scaling Environments for Agents workshop at NeurIPS 2025
♻ ☆ Continual Learning for Adaptive AI Systems
Continual learning the ability of a neural network to learn multiple sequential tasks without catastrophic forgetting remains a central challenge in developing adaptive artificial intelligence systems. While deep learning models achieve state-of-the-art performance across domains, they remain limited by overfitting and forgetting. This paper introduces Cluster-Aware Replay (CAR), a hybrid continual learning framework that integrates a small, class-balanced replay buffer with a regularization term based on Inter-Cluster Fitness (ICF) in the feature space. The ICF loss penalizes overlapping feature representations between new and previously learned tasks, encouraging geometric separation in the latent space and reducing interference. Using the standard five-task Split CIFAR-10 benchmark with a ResNet-18 backbone, initial experiments demonstrate that CAR better preserves earlier task performance compared to fine-tuning alone. These findings are preliminary but highlight feature-space regularization as a promising direction for mitigating catastrophic forgetting.
comment: Version 2: Revised abstract and figures. Updated terminology (ICF). Preliminary results
♻ ☆ On a Geometry of Interbrain Networks NeurIPS 2025
Effective analysis in neuroscience benefits significantly from robust conceptual frameworks. Traditional metrics of interbrain synchrony in social neuroscience typically depend on fixed, correlation-based approaches, restricting their explanatory capacity to descriptive observations. Inspired by the successful integration of geometric insights in network science, we propose leveraging discrete geometry to examine the dynamic reconfigurations in neural interactions during social exchanges. Unlike conventional synchrony approaches, our method interprets inter-brain connectivity changes through the evolving geometric structures of neural networks. This geometric framework is realized through a pipeline that identifies critical transitions in network connectivity using entropy metrics derived from curvature distributions. By doing so, we significantly enhance the capacity of hyperscanning methodologies to uncover underlying neural mechanisms in interactive social behavior.
comment: 4 pages, 1 figure, accepted at NeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations
♻ ☆ LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement ICCV 2025
Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning, yet full parameter fine-tuning is often computationally prohibitive for large models. Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. While LoRA allows for efficient fine-tuning, it requires significant data for adaptation, making Federated Learning (FL) an appealing solution due to its privacy-preserving collaborative framework. However, combining LoRA with FL introduces two key challenges: the \textbf{Server-Side Aggregation Bias}, where server-side averaging of LoRA matrices diverges from the ideal global update, and the \textbf{Client-Side Initialization Lag}, emphasizing the need for consistent initialization across rounds. Existing approaches address these challenges individually, limiting their effectiveness. We propose LoRA-FAIR, a novel method that tackles both issues by introducing a correction term on the server, enhancing aggregation efficiency and accuracy. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods. Experimental results on ViT and MLP-Mixer models across large-scale datasets demonstrate that LoRA-FAIR consistently achieves performance improvements in FL settings.
comment: ICCV 2025, Code is available: https://github.com/jmbian/LoRA-FAIR
♻ ☆ MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption. Compared to LoRA, MLorc avoids enforcing a fixed-rank constraint on weight update matrices and thus enables full-parameter learning. Compared to GaLore, MLorc directly compress the momentum rather than gradients, thereby better preserving the training dynamics of full-parameter fine-tuning. We provide a theoretical guarantee for its convergence under mild assumptions. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning at small ranks (e.g., $r=4$), and generalizes well across different optimizers -- all while not compromising time or memory efficiency.
♻ ☆ AI-powered skin spectral imaging enables instant sepsis diagnosis and outcome prediction in critically ill patients
With sepsis remaining a leading cause of mortality, early identification of patients with sepsis and those at high risk of death is a challenge of high socioeconomic importance. Given the potential of hyperspectral imaging (HSI) to monitor microcirculatory alterations, we propose a deep learning approach to automated sepsis diagnosis and mortality prediction using a single HSI cube acquired within seconds. In a prospective observational study, we collected HSI data from the palms and fingers of more than 480 intensive care unit patients. Neural networks applied to HSI measurements predicted sepsis and mortality with areas under the receiver operating characteristic curve (AUROCs) of 0.80 and 0.72, respectively. Performance improved substantially with additional clinical data, reaching AUROCs of 0.94 for sepsis and 0.83 for mortality. We conclude that deep learning-based HSI analysis enables rapid and noninvasive prediction of sepsis and mortality, with a potential clinical value for enhancing diagnosis and treatment.
comment: Markus A. Weigand, Lena Maier-Hein and Maximilian Dietrich contributed equally
♻ ☆ Understanding and Improving Length Generalization in Recurrent Models
Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths-i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the unexplored states hypothesis, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1\%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.
♻ ☆ State-Covering Trajectory Stitching for Diffusion Planners NeurIPS 2025
Diffusion-based generative models are emerging as powerful tools for long-horizon planning in reinforcement learning (RL), particularly with offline datasets. However, their performance is fundamentally limited by the quality and diversity of training data. This often restricts their generalization to tasks outside their training distribution or longer planning horizons. To overcome this challenge, we propose State-Covering Trajectory Stitching (SCoTS), a novel reward-free trajectory augmentation method that incrementally stitches together short trajectory segments, systematically generating diverse and extended trajectories. SCoTS first learns a temporal distance-preserving latent representation that captures the underlying temporal structure of the environment, then iteratively stitches trajectory segments guided by directional exploration and novelty to effectively cover and expand this latent space. We demonstrate that SCoTS significantly improves the performance and generalization capabilities of diffusion planners on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning. Furthermore, augmented trajectories generated by SCoTS significantly improve the performance of widely used offline goal-conditioned RL algorithms across diverse environments.
comment: Accepted to NeurIPS 2025
♻ ☆ Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection
The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods rely solely on softmax probabilities, overlooking the rich discriminative potential of the feature embeddings learned by VLMs trained on millions of samples. To address this limitation, we propose a novel context optimization (CoOp)-based framework that integrates subspace representation learning with prompt tuning. Our approach improves ID-OOD separability by projecting the ID features into a subspace spanned by prompt vectors, while projecting ID-irrelevant features into an orthogonal null space. To train such OOD detection framework, we design an easy-to-handle end-to-end learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy. Experiments on real-world datasets showcase the effectiveness of our approach.
♻ ☆ Graph Your Own Prompt NeurIPS 2025
We propose Graph Consistency Regularization (GCR), a novel framework that injects relational graph structures, derived from model predictions, into the learning process to promote class-aware, semantically meaningful feature representations. Functioning as a form of self-prompting, GCR enables the model to refine its internal structure using its own outputs. While deep networks learn rich representations, these often capture noisy inter-class similarities that contradict the model's predicted semantics. GCR addresses this issue by introducing parameter-free Graph Consistency Layers (GCLs) at arbitrary depths. Each GCL builds a batch-level feature similarity graph and aligns it with a global, class-aware masked prediction graph, derived by modulating softmax prediction similarities with intra-class indicators. This alignment enforces that feature-level relationships reflect class-consistent prediction behavior, acting as a semantic regularizer throughout the network. Unlike prior work, GCR introduces a multi-layer, cross-space graph alignment mechanism with adaptive weighting, where layer importance is learned from graph discrepancy magnitudes. This allows the model to prioritize semantically reliable layers and suppress noisy ones, enhancing feature quality without modifying the architecture or training procedure. GCR is model-agnostic, lightweight, and improves semantic structure across various networks and datasets. Experiments show that GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization, offering a new perspective on learning from prediction structure. [Project website](https://darcyddx.github.io/gcr/) [Code](https://github.com/Darcyddx/graph-prompt)
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.
♻ ☆ A Clinical-grade Universal Foundation Model for Intraoperative Pathology
Intraoperative pathology is pivotal to precision surgery, yet its clinical impact is constrained by diagnostic complexity and the limited availability of high-quality frozen-section data. While computational pathology has made significant strides, the lack of large-scale, prospective validation has impeded its routine adoption in surgical workflows. Here, we introduce CRISP, a clinical-grade foundation model developed on over 100,000 frozen sections from eight medical centers, specifically designed to provide Clinical-grade Robust Intraoperative Support for Pathology (CRISP). CRISP was comprehensively evaluated on more than 15,000 intraoperative slides across nearly 100 retrospective diagnostic tasks, including benign-malignant discrimination, key intraoperative decision-making, and pan-cancer detection, etc. The model demonstrated robust generalization across diverse institutions, tumor types, and anatomical sites-including previously unseen sites and rare cancers. In a prospective cohort of over 2,000 patients, CRISP sustained high diagnostic accuracy under real-world conditions, directly informing surgical decisions in 92.6% of cases. Human-AI collaboration further reduced diagnostic workload by 35%, avoided 105 ancillary tests and enhanced detection of micrometastases with 87.5% accuracy. Together, these findings position CRISP as a clinical-grade paradigm for AI-driven intraoperative pathology, bridging computational advances with surgical precision and accelerating the translation of artificial intelligence into routine clinical practice.
♻ ☆ SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety responses. Expert-level interventions are conducted to probe causality and to test mitigation. Targeted masking of SAFEx-selected experts reveals that safety behavior is highly concentrated. On Qwen3-30B-A3B, configured with 48 MoE-FFN layers and 128 experts per layer under top-8 routing (48x128=6,144 experts in total), disabling 12 selected experts reduces the refusal rate by 22%. In addition, lightweight adaptation is performed using LoRA under three configurations-the HRCG, the union of HCDG and HRCG, and all experts-and the resulting updates are composed through negative weight merging targeted at the HRCG, leading to improved refusal under adversarial prompts without full-model retraining. These results establish positional vulnerability as a distinct MoE-specific safety challenge and provide a practical, compute-efficient pathway for expert-level safety interventions within routed architectures.
comment: 10 pages, 8 figures
♻ ☆ PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling
We propose PDE-Transformer, a novel sequence modeling paradigm that casts the forward pass of a Transformer as the numerical discretization of a continuous reaction-diffusion system derived from a variational energy functional. In our framework, token embeddings evolve under a partial differential equation whose nonlocal integral term models self-attention, local reaction term models feed-forward layers, diffusion term encodes positional smoothing, and a stability control term corresponds to layer normalization. From this unifying perspective, we design an Adaptive PDE Diffusion Layer-an efficient, learnable finite-difference stencil that enforces local smoothness in feature space with linear time complexity and complements self-attention's global routing. Through a systematic theoretical analysis based on four pillars:stability, diffusion geometry, multi-scale dynamics, and component coupling, we derive principled guidelines for integrating the PDE layer at seven candidate points in the Transformer. Empirically, on the Long Range Arena benchmark, placing the layer immediately after embedding yields a 4.1 pp average accuracy gain over a strong baseline, and an adaptive multi-scale variant delivers further improvements. Our work thus offers a principled, lightweight mechanism to bolster long-range dependency modeling by harmonizing continuous PDE smoothing with discrete self-attention.
♻ ☆ Logistic-Gated Operators Enable Auditable Unit-Aware Thresholds in Symbolic Regression
Symbolic regression promises readable equations but struggles to encode unit-aware thresholds and conditional logic. We propose logistic-gated operators (LGO) -- differentiable gates with learnable location and steepness -- embedded as typed primitives and mapped back to physical units for audit. Across two primary health datasets (ICU, NHANES), the hard-gate variant recovers clinically plausible cut-points: 71% (5/7) of assessed thresholds fall within 10% of guideline anchors and 100% within 20%, while using far fewer gates than the soft variant (ICU median 4.0 vs 10.0; NHANES 5.0 vs 12.5), and remaining within the competitive accuracy envelope of strong SR baselines. On predominantly smooth tasks, gates are pruned, preserving parsimony. The result is compact symbolic equations with explicit, unit-aware thresholds that can be audited against clinical anchors -- turning interpretability from a post-hoc explanation into a modeling constraint and equipping symbolic regression with a practical calculus for regime switching and governance-ready deployment.
♻ ☆ MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.
comment: Accepted to EMNLP 2025 Main as an oral presentation. Website: https://eth-lre.github.io/mathtutorbench
♻ ☆ Local Learning for Covariate Selection in Nonparametric Causal Effect Estimation with Latent Variables
Estimating causal effects from nonexperimental data is a fundamental problem in many fields of science. A key component of this task is selecting an appropriate set of covariates for confounding adjustment to avoid bias. Most existing methods for covariate selection often assume the absence of latent variables and rely on learning the global network structure among variables. However, identifying the global structure can be unnecessary and inefficient, especially when our primary interest lies in estimating the effect of a treatment variable on an outcome variable. To address this limitation, we propose a novel local learning approach for covariate selection in nonparametric causal effect estimation, which accounts for the presence of latent variables. Our approach leverages testable independence and dependence relationships among observed variables to identify a valid adjustment set for a target causal relationship, ensuring both soundness and completeness under standard assumptions. We validate the effectiveness of our algorithm through extensive experiments on both synthetic and real-world data.
♻ ☆ Clustering by Nonparametric Smoothing
A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/CNS
♻ ☆ Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles ICLR 2026
Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new optimization procedure: Rolling Ball Optimizer (RBO), that breaks this spatial locality by incorporating information from a larger region of the loss landscape in its updates. We achieve this by simulating the motion of a rigid sphere of finite radius rolling on the loss landscape, a straightforward generalization of Gradient Descent (GD) that simplifies into it in the infinitesimal limit. The radius serves as a hyperparameter that determines the scale at which RBO sees the loss landscape, allowing control over the granularity of its interaction therewith. We are motivated by the intuition that the large-scale geometry of the loss landscape is less data-specific than its fine-grained structure, and that it is easier to optimize. We support this intuition by proving that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD, on MNIST and CIFAR-10/100 demonstrates promising results in terms of convergence speed, training accuracy, and generalization performance.
comment: Submitted for review to ICLR 2026
♻ ☆ Explainable Deep Neural Network for Multimodal ECG Signals: Intermediate vs Late Fusion
The limitations of unimodal deep learning models, particularly their tendency to overfit and limited generalizability, have renewed interest in multimodal fusion strategies. Multimodal deep neural networks (MDNN) have the capability of integrating diverse data domains and offer a promising solution for robust and accurate predictions. However, the optimal fusion strategy, intermediate fusion (feature-level) versus late fusion (decision-level) remains insufficiently examined, especially in high-stakes clinical contexts such as ECG-based cardiovascular disease (CVD) classification. This study investigates the comparative effectiveness of intermediate and late fusion strategies using ECG signals across three domains: time, frequency, and time-frequency. A series of experiments were conducted to identify the highest-performing fusion architecture. Results demonstrate that intermediate fusion consistently outperformed late fusion, achieving a peak accuracy of 97 percent, with Cohen's d > 0.8 relative to standalone models and d = 0.40 compared to late fusion. Interpretability analyses using saliency maps reveal that both models align with the discretized ECG signals. Statistical dependency between the discretized ECG signals and corresponding saliency maps for each class was confirmed using Mutual Information (MI). The proposed ECG domain-based multimodal model offers superior predictive capability and enhanced explainability, crucial attributes in medical AI applications, surpassing state-of-the-art models.
Computer Vision and Pattern Recognition 37
☆ Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure
Photoplethysmography (PPG) sensor in wearable and clinical devices provides valuable physiological insights in a non-invasive and real-time fashion. Specialized Foundation Models (FM) or repurposed time-series FMs are used to benchmark physiological tasks. Our experiments with fine-tuning FMs reveal that Vision FM (VFM) can also be utilized for this purpose and, in fact, surprisingly leads to state-of-the-art (SOTA) performance on many tasks, notably blood pressure estimation. We leverage VFMs by simply transforming one-dimensional PPG signals into image-like two-dimensional representations, such as the Short-Time Fourier transform (STFT). Using the latest VFMs, such as DINOv3 and SIGLIP-2, we achieve promising performance on other vital signs and blood lab measurement tasks as well. Our proposal, Vision4PPG, unlocks a new class of FMs to achieve SOTA performance with notable generalization to other 2D input representations, including STFT phase and recurrence plots. Our work improves upon prior investigations of vision models for PPG by conducting a comprehensive study, comparing them to state-of-the-art time-series FMs, and demonstrating the general PPG processing ability by reporting results on six additional tasks. Thus, we provide clinician-scientists with a new set of powerful tools that is also computationally efficient, thanks to Parameter-Efficient Fine-Tuning (PEFT) techniques.
comment: BHI abstract extended
☆ PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion NeurIPS 2025
Point cloud completion is essential for robust 3D perception in safety-critical applications such as robotics and augmented reality. However, existing models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time. To address this limitation, we propose PointMAC, a meta-learned framework for robust test-time adaptation in point cloud completion. It enables sample-specific refinement without requiring additional supervision. Our method optimizes the completion model under two self-supervised auxiliary objectives that simulate structural and sensor-level incompleteness. A meta-auxiliary learning strategy based on Model-Agnostic Meta-Learning (MAML) ensures that adaptation driven by auxiliary objectives is consistently aligned with the primary completion task. During inference, we adapt the shared encoder on-the-fly by optimizing auxiliary losses, with the decoder kept fixed. To further stabilize adaptation, we introduce Adaptive $\lambda$-Calibration, a meta-learned mechanism for balancing gradients between primary and auxiliary objectives. Extensive experiments on synthetic, simulated, and real-world datasets demonstrate that PointMAC achieves state-of-the-art results by refining each sample individually to produce high-quality completions. To the best of our knowledge, this is the first work to apply meta-auxiliary test-time adaptation to point cloud completion.
comment: NeurIPS 2025
☆ Ortho-Fuse: Orthomosaic Generation for Sparse High-Resolution Crop Health Maps Through Intermediate Optical Flow Estimation
AI-driven crop health mapping systems offer substantial advantages over conventional monitoring approaches through accelerated data acquisition and cost reduction. However, widespread farmer adoption remains constrained by technical limitations in orthomosaic generation from sparse aerial imagery datasets. Traditional photogrammetric reconstruction requires 70-80\% inter-image overlap to establish sufficient feature correspondences for accurate geometric registration. AI-driven systems operating under resource-constrained conditions cannot consistently achieve these overlap thresholds, resulting in degraded reconstruction quality that undermines user confidence in autonomous monitoring technologies. In this paper, we present Ortho-Fuse, an optical flow-based framework that enables the generation of a reliable orthomosaic with reduced overlap requirements. Our approach employs intermediate flow estimation to synthesize transitional imagery between consecutive aerial frames, artificially augmenting feature correspondences for improved geometric reconstruction. Experimental validation demonstrates a 20\% reduction in minimum overlap requirements. We further analyze adoption barriers in precision agriculture to identify pathways for enhanced integration of AI-driven monitoring systems.
comment: 6 Figures, 9 pages
☆ Ordinal Scale Traffic Congestion Classification with Multi-Modal Vision-Language and Motion Analysis
Accurate traffic congestion classification is essential for intelligent transportation systems and real-time urban traffic management. This paper presents a multimodal framework combining open-vocabulary visual-language reasoning (CLIP), object detection (YOLO-World), and motion analysis via MOG2-based background subtraction. The system predicts congestion levels on an ordinal scale from 1 (free flow) to 5 (severe congestion), enabling semantically aligned and temporally consistent classification. To enhance interpretability, we incorporate motion-based confidence weighting and generate annotated visual outputs. Experimental results show the model achieves 76.7 percent accuracy, an F1 score of 0.752, and a Quadratic Weighted Kappa (QWK) of 0.684, significantly outperforming unimodal baselines. These results demonstrate the framework's effectiveness in preserving ordinal structure and leveraging visual-language and motion modalities. Future enhancements include incorporating vehicle sizing and refined density metrics.
comment: 7 pages, 4 figures. Preprint submitted to arXiv in October 2025
☆ From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries NeurIPS 2025
Real-world scenes, such as those in ScanNet, are difficult to capture, with highly limited data available. Generating realistic scenes with varied object poses remains an open and challenging task. In this work, we propose FactoredScenes, a framework that synthesizes realistic 3D scenes by leveraging the underlying structure of rooms while learning the variation of object poses from lived-in scenes. We introduce a factored representation that decomposes scenes into hierarchically organized concepts of room programs and object poses. To encode structure, FactoredScenes learns a library of functions capturing reusable layout patterns from which scenes are drawn, then uses large language models to generate high-level programs, regularized by the learned library. To represent scene variations, FactoredScenes learns a program-conditioned model to hierarchically predict object poses, and retrieves and places 3D objects in a scene. We show that FactoredScenes generates realistic, real-world rooms that are difficult to distinguish from real ScanNet scenes.
comment: NeurIPS 2025
☆ SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation ICML
We propose SAM2LoRA, a parameter-efficient fine-tuning strategy that adapts the Segment Anything Model 2 (SAM2) for fundus image segmentation. SAM2 employs a masked autoencoder-pretrained Hierarchical Vision Transformer for multi-scale feature decoding, enabling rapid inference in low-resource settings; however, fine-tuning remains challenging. To address this, SAM2LoRA integrates a low-rank adapter into both the image encoder and mask decoder, requiring fewer than 5\% of the original trainable parameters. Our analysis indicates that for cross-dataset fundus segmentation tasks, a composite loss function combining segmentationBCE, SoftDice, and FocalTversky losses is essential for optimal network tuning. Evaluated on 11 challenging fundus segmentation datasets, SAM2LoRA demonstrates high performance in both blood vessel and optic disc segmentation under cross-dataset training conditions. It achieves Dice scores of up to 0.86 and 0.93 for blood vessel and optic disc segmentation, respectively, and AUC values of up to 0.98 and 0.99, achieving state-of-the-art performance while substantially reducing training overhead.
comment: Accepted for publication at the 2025 International Conference on Machine Learning and Applications (ICMLA)
☆ Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking
Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .
☆ ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlike prior methods that rely on iterative, brute-force attacks, ArtPerception introduces a systematic, two-phase methodology. Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition. Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack. We propose a Modified Levenshtein Distance (MLD) metric for a more nuanced evaluation of an LLM's recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework's real-world relevance by showing its successful transferability to leading commercial models, including GPT-4o, Claude Sonnet 3.7, and DeepSeek-V3, and by conducting a rigorous effectiveness analysis against potential defenses such as LLaMA Guard and Azure's content filters. Our findings underscore that true LLM security requires defending against a multi-modal space of interpretations, even within text-only inputs, and highlight the effectiveness of strategic, reconnaissance-based attacks. Content Warning: This paper includes potentially harmful and offensive model outputs.
comment: 30 pages, 22 figures. This preprint has been accepted for publication in Elsevier JOURNAL OF NETWORK AND COMPUTER APPLICATIONS (JNCA)
☆ X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/
comment: preprint, technical report, 33 pages
☆ VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework
Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands. To address these issues, we propose VividAnimator, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.
comment: Comments: 10 pages, 6 figures
☆ Opacity-Gradient Driven Density Control for Compact and Efficient Few-Shot 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) struggles in few-shot scenarios, where its standard adaptive density control (ADC) can lead to overfitting and bloated reconstructions. While state-of-the-art methods like FSGS improve quality, they often do so by significantly increasing the primitive count. This paper presents a framework that revises the core 3DGS optimization to prioritize efficiency. We replace the standard positional gradient heuristic with a novel densification trigger that uses the opacity gradient as a lightweight proxy for rendering error. We find this aggressive densification is only effective when paired with a more conservative pruning schedule, which prevents destructive optimization cycles. Combined with a standard depth-correlation loss for geometric guidance, our framework demonstrates a fundamental improvement in efficiency. On the 3-view LLFF dataset, our model is over 40% more compact (32k vs. 57k primitives) than FSGS, and on the Mip-NeRF 360 dataset, it achieves a reduction of approximately 70%. This dramatic gain in compactness is achieved with a modest trade-off in reconstruction metrics, establishing a new state-of-the-art on the quality-vs-efficiency Pareto frontier for few-shot view synthesis.
☆ Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.
☆ MRI Brain Tumor Detection with Computer Vision
This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U-Net for semantic segmentation and EfficientDet for anchor-based object detection to enhance the localization and identification of tumors. Our results demonstrate promising improvements in the accuracy and efficiency of brain tumor diagnostics, underscoring the potential of deep learning in medical imaging and its significance in improving clinical outcomes.
comment: 12 pages, 8 figures, final project report for CS4100 (Machine Learning), Northeastern University, April 2024
♻ ☆ MultiCOIN: Multi-Modal COntrollable Video INbetweening
Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
comment: Project website: https://multicoinx.github.io/multicoin/
♻ ☆ MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
♻ ☆ The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
♻ ☆ Motion Capture from Inertial and Vision Sensors
Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS dataset, we propose a SparseNet framework to capture human motion from IMUs and videos by discovering their supplementary features and exploring the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.
comment: 10 pages,8 figures
♻ ☆ CrunchLLM: Multitask LLMs for Structured Business Reasoning and Outcome Prediction
Predicting the success of start-up companies, defined as achieving an exit through acquisition or IPO, is a critical problem in entrepreneurship and innovation research. Datasets such as Crunchbase provide both structured information (e.g., funding rounds, industries, investor networks) and unstructured text (e.g., company descriptions), but effectively leveraging this heterogeneous data for prediction remains challenging. Traditional machine learning approaches often rely only on structured features and achieve moderate accuracy, while large language models (LLMs) offer rich reasoning abilities but struggle to adapt directly to domain-specific business data. We present \textbf{CrunchLLM}, a domain-adapted LLM framework for startup success prediction. CrunchLLM integrates structured company attributes with unstructured textual narratives and applies parameter-efficient fine-tuning strategies alongside prompt optimization to specialize foundation models for entrepreneurship data. Our approach achieves accuracy exceeding 80\% on Crunchbase startup success prediction, significantly outperforming traditional classifiers and baseline LLMs. Beyond predictive performance, CrunchLLM provides interpretable reasoning traces that justify its predictions, enhancing transparency and trustworthiness for financial and policy decision makers. This work demonstrates how adapting LLMs with domain-aware fine-tuning and structured--unstructured data fusion can advance predictive modeling of entrepreneurial outcomes. CrunchLLM contributes a methodological framework and a practical tool for data-driven decision making in venture capital and innovation policy.
♻ ☆ FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition
Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.
♻ ☆ Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding
Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that references vision tokens to guide text generation. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization, and using its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT consistently enhances visual grounding with minimal computational overhead, and achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$.
comment: Code available at https://github.com/bscho333/ReVisiT
♻ ☆ MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.
♻ ☆ Investigating VLM Hallucination from a Cognitive Psychology Perspective: A First Step Toward Interpretation with Intriguing Observations
Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, and may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs' cognitive biases that lead to hallucinations, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: appeal to authority. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.
♻ ☆ Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for RAG-Based Medical Diagnosis
Retrieving relevant visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. We develop a multimodal retrieval model jointly optimized with an LVLM for medical diagnosis, unlike standard RAG which doesn't backpropagate LVLM errors to the retriever. Using only general-purpose backbones with lightweight fine-tuning, our model achieves competitive results with medically-pretrained models on clinical classification and VQA tasks. In a novel analysis, we find that different top-retrieved images often yield different predictions for the same target, and that these cases are challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap -- leaving ample room for improvement by future methods. Code available at https://github.com/Nirmaz/JOMED.
♻ ☆ Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer Network for Trajectory Prediction
Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving social robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. Hyper-STTN constructs multiscale hypergraphs of varying group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on random-walk probabilities. In parallel, a spatial-temporal transformer is employed to learn pedestrians' pairwise latent interactions across spatial and temporal dimensions. These heterogeneous groupwise and pairwise features are subsequently fused and aligned via a multimodal transformer. Extensive experiments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models.
♻ ☆ Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction ICCV 2025
Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without depth estimator. Plus, our method enjoys over 5$\times$ faster speed ($\sim$6s on an A100 GPU). Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive results. The code and models are publicly available at https://github.com/caiyuanhao1998/Open-DiffusionGS
comment: ICCV 2025; A novel one-stage 3DGS-based diffusion for 3D object generation and scene reconstruction from a single view in ~6 seconds
♻ ☆ OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions NeurIPS 2025
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus
comment: NeurIPS 2025; A data construction pipeline and a diffusion Transformer framework for controllable subject-driven video customization
♻ ☆ CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models
As Text-to-Image models continue to evolve, so does the risk of generating unsafe, copyrighted, or privacy-violating content. Existing safety interventions - ranging from training data curation and model fine-tuning to inference-time filtering and guidance - often suffer from incomplete concept removal, susceptibility to jail-breaking, computational inefficiency, or collateral damage to unrelated capabilities. In this paper, we introduce CURE, a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models, enabling fast, interpretable, and highly specific suppression of undesired concepts. At the core of our method is the Spectral Eraser, a closed-form, orthogonal projection module that identifies discriminative subspaces using Singular Value Decomposition over token embeddings associated with the concepts to forget and retain. Intuitively, the Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes. This operator is then applied in a single step update to yield an edited model in which the target concept is effectively unlearned - without retraining, supervision, or iterative optimization. To balance the trade-off between filtering toxicity and preserving unrelated concepts, we further introduce an Expansion Mechanism for spectral regularization which selectively modulates singular vectors based on their relative significance to control the strength of forgetting. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only $2$ seconds. Benchmarking against prior approaches, CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content, with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming.
♻ ☆ OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates
Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models are available at https://github.com/jp-guo/OSCAR.
♻ ☆ A Vision-Based Shared-Control Teleoperation Scheme for Controlling the Robotic Arm of a Four-Legged Robot
In hazardous and remote environments, robotic systems perform critical tasks demanding improved safety and efficiency. Among these, quadruped robots with manipulator arms offer mobility and versatility for complex operations. However, teleoperating quadruped robots is challenging due to the lack of integrated obstacle detection and intuitive control methods for the robotic arm, increasing collision risks in confined or dynamically changing workspaces. Teleoperation via joysticks or pads can be non-intuitive and demands a high level of expertise due to its complexity, culminating in a high cognitive load on the operator. To address this challenge, a teleoperation approach that directly maps human arm movements to the robotic manipulator offers a simpler and more accessible solution. This work proposes an intuitive remote control by leveraging a vision-based pose estimation pipeline that utilizes an external camera with a machine learning-based model to detect the operator's wrist position. The system maps these wrist movements into robotic arm commands to control the robot's arm in real-time. A trajectory planner ensures safe teleoperation by detecting and preventing collisions with both obstacles and the robotic arm itself. The system was validated on the real robot, demonstrating robust performance in real-time control. This teleoperation approach provides a cost-effective solution for industrial applications where safety, precision, and ease of use are paramount, ensuring reliable and intuitive robotic control in high-risk environments.
♻ ☆ Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization
Recent advances in conditional generative modeling have introduced Continuous conditional Generative Adversarial Network (CcGAN) and Continuous Conditional Diffusion Model (CCDM) for estimating high-dimensional data distributions conditioned on scalar, continuous regression labels (e.g., angles, ages, or temperatures). However, these approaches face fundamental limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. To address these issues, we propose CcGAN-AVAR, an enhanced CcGAN framework featuring (1) two novel components for handling data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity size and a multi-task discriminator that enhances generator training through auxiliary regression and density ratio estimation - and (2) the GAN framework's native one-step generator, enable 30x-2000x faster inference than CCDM. Extensive experiments on four benchmark datasets (64x64 to 256x256 resolution) across eleven challenging settings demonstrate that CcGAN-AVAR achieves state-of-the-art generation quality while maintaining sampling efficiency.
♻ ☆ Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation
This paper presents a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, with a focus on improving precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.
♻ ☆ A Synthetic Dataset for Manometry Recognition in Robotic Applications
This paper addresses the challenges of data scarcity and high acquisition costs in training robust object detection models for complex industrial environments, such as offshore oil platforms. Data collection in these hazardous settings often limits the development of autonomous inspection systems. To mitigate this issue, we propose a hybrid data synthesis pipeline that integrates procedural rendering and AI-driven video generation. The approach uses BlenderProc to produce photorealistic images with domain randomization and NVIDIA's Cosmos-Predict2 to generate physically consistent video sequences with temporal variation. A YOLO-based detector trained on a composite dataset, combining real and synthetic data, outperformed models trained solely on real images. A 1:1 ratio between real and synthetic samples achieved the highest accuracy. The results demonstrate that synthetic data generation is a viable, cost-effective, and safe strategy for developing reliable perception systems in safety-critical and resource-constrained industrial applications.
♻ ☆ Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning
Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.
comment: Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
♻ ☆ Improving Hierarchical Representations of Vectorized HD Maps with Perspective Clues
The construction of vectorized High-Definition (HD) maps from onboard surround-view cameras has become a significant focus in autonomous driving. However, current map vector estimation pipelines face two key limitations: input-agnostic queries struggle to capture complex map structures, and the view transformation leads to information loss. These issues often result in inaccurate shape restoration or missing instances in map predictions. To address this concern, we propose a novel approach, namely \textbf{PerCMap}, which explicitly exploits clues from perspective-view features at both instance and point level. Specifically, at instance level, we propose Cross-view Instance Activation (CIA) to activate instance queries across surround-view images, thereby helping the model recover the instance attributes of map vectors. At point level, we design Dual-view Point Embedding (DPE), which fuses features from both views to generate input-aware positional embeddings and improve the accuracy of point coordinate estimation. Extensive experiments on \textit{nuScenes} and \textit{Argoverse 2} demonstrate that PerCMap achieves strong and consistent performance across benchmarks, reaching 67.1 and 70.5 mAP, respectively.
♻ ☆ {S\textsuperscript{2}M\textsuperscript{2}}: Scalable Stereo Matching Model for Reliable Depth Estimation ICCV
The pursuit of a generalizable stereo matching model, capable of performing well across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. However, global matching architectures, while theoretically more robust, have historically been rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with {S\textsuperscript{2}M\textsuperscript{2}}: a global matching architecture that achieves state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. {S\textsuperscript{2}M\textsuperscript{2}} establishes a new state of the art on Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods in most metrics while reconstructing high-quality details with competitive efficiency.
comment: 8 pages, 5 figures, ICCV accepted paper
♻ ☆ Large-Area Fabrication-Aware Computational Diffractive Optics SIGGRAPH
Differentiable optics, as an emerging paradigm that jointly optimizes optics and (optional) image processing algorithms, has made innovative optical designs possible across a broad range of applications. Many of these systems utilize diffractive optical components (DOEs) for holography, PSF engineering, or wavefront shaping. Existing approaches have, however, mostly remained limited to laboratory prototypes, owing to a large quality gap between simulation and manufactured devices. We aim at lifting the fundamental technical barriers to the practical use of learned diffractive optical systems. To this end, we propose a fabrication-aware design pipeline for diffractive optics fabricated by direct-write grayscale lithography followed by nano-imprinting replication, which is directly suited for inexpensive mass production of large area designs. We propose a super-resolved neural lithography model that can accurately predict the 3D geometry generated by the fabrication process. This model can be seamlessly integrated into existing differentiable optics frameworks, enabling fabrication-aware, end-to-end optimization of computational optical systems. To tackle the computational challenges, we also devise tensor-parallel compute framework centered on distributing large-scale FFT computation across many GPUs. As such, we demonstrate large scale diffractive optics designs up to 32.16 mm $\times$ 21.44 mm, simulated on grids of up to 128,640 by 85,760 feature points. We find adequate agreement between simulation and fabricated prototypes for applications such as holography and PSF engineering. We also achieve high image quality from an imaging system comprised only of a single DOE, with images processed only by a Wiener filter utilizing the simulation PSF. We believe our findings lift the fabrication limitations for real-world applications of diffractive optics and differentiable optical design.
comment: To be appeared in SIGGRAPH Asia and ACM Trans. on Graphics 2025. Code is available at https://github.com/Vandermode/LAFA
♻ ☆ Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control
Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.
Computer Vision and Pattern Recognition 158
☆ StreamingVLM: Real-Time Understanding for Infinite Video Streams
Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.
comment: The first two authors contributed equally to this work
☆ VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
comment: Homepage: https://ltbai.github.io/VITA-VLA/
☆ SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .
comment: Project Page: https://peiwensun2000.github.io/mm2km/
☆ STaTS: Structure-Aware Temporal Sequence Summarization via Statistical Window Merging
Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90\% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.
comment: 10 pages, 5 figures, 4 tables. Under Review
☆ Vision Language Models: A Survey of 26K Papers
We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.
comment: VLM/LLM Learning Notes
☆ FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection
Object detection in biomedical settings is fundamentally constrained by the scarcity of labeled data and the frequent emergence of novel or rare categories. We present FSP-DETR, a unified detection framework that enables robust few-shot detection, open-set recognition, and generalization to unseen biomedical tasks within a single model. Built upon a class-agnostic DETR backbone, our approach constructs class prototypes from original support images and learns an embedding space using augmented views and a lightweight transformer decoder. Training jointly optimizes a prototype matching loss, an alignment-based separation loss, and a KL divergence regularization to improve discriminative feature learning and calibration under scarce supervision. Unlike prior work that tackles these tasks in isolation, FSP-DETR enables inference-time flexibility to support unseen class recognition, background rejection, and cross-task adaptation without retraining. We also introduce a new ova species detection benchmark with 20 parasite classes and establish standardized evaluation protocols. Extensive experiments across ova, blood cell, and malaria detection tasks demonstrate that FSP-DETR significantly outperforms prior few-shot and prototype-based detectors, especially in low-shot and open-set scenarios.
comment: 10 pages, 3 Figures, 5 Tables. Under Review
☆ Dyna-Mind: Learning to Simulate from Experience for Better AI Agents
Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
☆ TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control NeurIPS 2025
Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model's ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by conditioning the model's weights directly. Our framework uses a hypernetwork to generate LoRA adapters on-the-fly, tailoring weight modifications for the frozen backbone at each diffusion step based on time and the user's condition. This mechanism enables the model to learn and execute an explicit, adaptive strategy for applying conditional guidance throughout the entire generation process. Through experiments on various data domains, we demonstrate that this dynamic, parametric control significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods. TC-LoRA establishes an alternative approach in which the model's conditioning strategy is modified through a deeper functional adaptation of its weights, allowing control to align with the dynamic demands of the task and generative stage.
comment: 10 pages; NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE)
☆ FLOWING: Implicit Neural Flows for Structure-Preserving Morphing
Morphing is a long-standing problem in vision and computer graphics, requiring a time-dependent warping for feature alignment and a blending for smooth interpolation. Recently, multilayer perceptrons (MLPs) have been explored as implicit neural representations (INRs) for modeling such deformations, due to their meshlessness and differentiability; however, extracting coherent and accurate morphings from standard MLPs typically relies on costly regularizations, which often lead to unstable training and prevent effective feature alignment. To overcome these limitations, we propose FLOWING (FLOW morphING), a framework that recasts warping as the construction of a differential vector flow, naturally ensuring continuity, invertibility, and temporal coherence by encoding structural flow properties directly into the network architectures. This flow-centric approach yields principled and stable transformations, enabling accurate and structure-preserving morphing of both 2D images and 3D shapes. Extensive experiments across a range of applications - including face and image morphing, as well as Gaussian Splatting morphing - show that FLOWING achieves state-of-the-art morphing quality with faster convergence. Code and pretrained models are available at http://schardong.github.io/flowing.
comment: 10 pages main paper; 9 pages references and appendix
☆ PRNet: Original Information Is All You Have
Small object detection in aerial images suffers from severe information degradation during feature extraction due to limited pixel representations, where shallow spatial details fail to align effectively with semantic information, leading to frequent misses and false positives. Existing FPN-based methods attempt to mitigate these losses through post-processing enhancements, but the reconstructed details often deviate from the original image information, impeding their fusion with semantic content. To address this limitation, we propose PRNet, a real-time detection framework that prioritizes the preservation and efficient utilization of primitive shallow spatial features to enhance small object representations. PRNet achieves this via two modules:the Progressive Refinement Neck (PRN) for spatial-semantic alignment through backbone reuse and iterative refinement, and the Enhanced SliceSamp (ESSamp) for preserving shallow information during downsampling via optimized rearrangement and convolution. Extensive experiments on the VisDrone, AI-TOD, and UAVDT datasets demonstrate that PRNet outperforms state-of-the-art methods under comparable computational constraints, achieving superior accuracy-efficiency trade-offs.
☆ Diagonal Artifacts in Samsung Images: PRNU Challenges and Solutions
We investigate diagonal artifacts present in images captured by several Samsung smartphones and their impact on PRNU-based camera source verification. We first show that certain Galaxy S series models share a common pattern causing fingerprint collisions, with a similar issue also found in some Galaxy A models. Next, we demonstrate that reliable PRNU verification remains feasible for devices supporting PRO mode with raw capture, since raw images bypass the processing pipeline that introduces artifacts. This option, however, is not available for the mid-range A series models or in forensic cases without access to raw images. Finally, we outline potential forensic applications of the diagonal artifacts, such as reducing misdetections in HDR images and localizing regions affected by synthetic bokeh in portrait-mode images.
☆ PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.
☆ A methodology for clinically driven interactive segmentation evaluation
Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.
comment: 10 pages, Medical Image Computing and Computed Assisted Intervention 2025
☆ Few-shot multi-token DreamBooth with LoRa for style-consistent character generation
The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.
☆ D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
☆ SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests
Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.
comment: 8 pages, 5 figures
☆ Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement
Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.
☆ Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians
Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.
comment: 19 pages, 9 figures
☆ Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation
Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker's intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.
☆ Utilizing dynamic sparsity on pretrained DETR
Efficient inference with transformer-based models remains a challenge, especially in vision tasks like object detection. We analyze the inherent sparsity in the MLP layers of DETR and introduce two methods to exploit it without retraining. First, we propose Static Indicator-Based Sparsification (SIBS), a heuristic method that predicts neuron inactivity based on fixed activation patterns. While simple, SIBS offers limited gains due to the input-dependent nature of sparsity. To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR. MGS predicts dynamic sparsity using a small linear layer and achieves up to 85 to 95% activation sparsity. Experiments on the COCO dataset show that MGS maintains or even improves performance while significantly reducing computation. Our method offers a practical, input-adaptive approach to sparsification, enabling efficient deployment of pretrained vision transformers without full model retraining.
comment: 6 pages 4 figures and 4 tables , accepted for 2025 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, AUG. 31 to SEP. 3, 2025, ISTANBUL, TURKEY
☆ Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification
Accurate forest biomass quantification is vital for carbon cycle monitoring. While airborne LiDAR excels at capturing 3D forest structure, directly estimating woody volume and Aboveground Biomass (AGB) from point clouds is challenging due to difficulties in modeling long-range dependencies needed to distinguish trees.We propose Minkowski-MambaNet, a novel deep learning framework that directly estimates volume and AGB from raw LiDAR. Its key innovation is integrating the Mamba model's Selective State Space Model (SSM) into a Minkowski network, enabling effective encoding of global context and long-range dependencies for improved tree differentiation. Skip connections are incorporated to enhance features and accelerate convergence.Evaluated on Danish National Forest Inventory LiDAR data, Minkowski-MambaNet significantly outperforms state-of-the-art methods, providing more accurate and robust estimates. Crucially, it requires no Digital Terrain Model (DTM) and is robust to boundary artifacts. This work offers a powerful tool for large-scale forest biomass analysis, advancing LiDAR-based forest inventories.
☆ A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
☆ Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via patch matching-based multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Source code will be released upon publication.
☆ BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception NeurIPS 2025
Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice
comment: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks
☆ Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.
comment: EMNLP2025. Code is avaible at https://github.com/bytedance/DynamicCoT
☆ Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark NeurIPS 2025
We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Progressive Prompt Fusion Network (PPFN). Specifically, the PPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model's features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a Selective Progressive Training (SPT) mechanism is introduced to gradually refine the model's handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most high-quality, multi-scenarios infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8.76\% improvement. Code is available at https://github.com/Zihang-Chen/HM-TIR.
comment: This paper has been accepted by NeurIPS 2025
Efficient Bayesian Inference from Noisy Pairwise Comparisons
Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
☆ Instance-Aware Robust Consistency Regularization for Semi-Supervised Nuclei Instance Segmentation
Nuclei instance segmentation in pathological images is crucial for downstream tasks such as tumor microenvironment analysis. However, the high cost and scarcity of annotated data limit the applicability of fully supervised methods, while existing semi-supervised methods fail to adequately regularize consistency at the instance level, lack leverage of the inherent prior knowledge of pathological structures, and are prone to introducing noisy pseudo-labels during training. In this paper, we propose an Instance-Aware Robust Consistency Regularization Network (IRCR-Net) for accurate instance-level nuclei segmentation. Specifically, we introduce the Matching-Driven Instance-Aware Consistency (MIAC) and Prior-Driven Instance-Aware Consistency (PIAC) mechanisms to refine the nuclei instance segmentation result of the teacher and student subnetwork, particularly for densely distributed and overlapping nuclei. We incorporate morphological prior knowledge of nuclei in pathological images and utilize these priors to assess the quality of pseudo-labels generated from unlabeled data. Low-quality pseudo-labels are discarded, while high-quality predictions are enhanced to reduce pseudo-label noise and benefit the network's robust training. Experimental results demonstrate that the proposed method significantly enhances semi-supervised nuclei instance segmentation performance across multiple public datasets compared to existing approaches, even surpassing fully supervised methods in some scenarios.
☆ Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation ICCV 2025
Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.
comment: ICCV 2025
☆ RadioFlow: Efficient Radio Map Construction Framework with Flow Matching
Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbf{RadioFlow}, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbf{up to 8$\times$ fewer parameters} and \textbf{over 4$\times$ faster inference} compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \href{https://github.com/Hxxxz0/RadioFlow}{GitHub}.
☆ Rewiring Development in Brain Segmentation: Leveraging Adult Brain Priors for Enhancing Infant MRI Segmentation
Accurate segmentation of infant brain MRI is critical for studying early neurodevelopment and diagnosing neurological disorders. Yet, it remains a fundamental challenge due to continuously evolving anatomy of the subjects, motion artifacts, and the scarcity of high-quality labeled data. In this work, we present LODi, a novel framework that utilizes prior knowledge from an adult brain MRI segmentation model to enhance the segmentation performance of infant scans. Given the abundance of publicly available adult brain MRI data, we pre-train a segmentation model on a large adult dataset as a starting point. Through transfer learning and domain adaptation strategies, we progressively adapt the model to the 0-2 year-old population, enabling it to account for the anatomical and imaging variability typical of infant scans. The adaptation of the adult model is carried out using weakly supervised learning on infant brain scans, leveraging silver-standard ground truth labels obtained with FreeSurfer. By introducing a novel training strategy that integrates hierarchical feature refinement and multi-level consistency constraints, our method enables fast, accurate, age-adaptive segmentation, while mitigating scanner and site-specific biases. Extensive experiments on both internal and external datasets demonstrate the superiority of our approach over traditional supervised learning and domain-specific models. Our findings highlight the advantage of leveraging adult brain priors as a foundation for age-flexible neuroimaging analysis, paving the way for more reliable and generalizable brain MRI segmentation across the lifespan.
☆ CapGeo: A Caption-Assisted Approach to Geometric Reasoning
Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
comment: preprint, under review
☆ Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling
Animals often forage via Levy walks stochastic trajectories with heavy tailed step lengths optimized for sparse resource environments. We show that human visual gaze follows similar dynamics when scanning images. While traditional models emphasize image based saliency, the underlying spatiotemporal statistics of eye movements remain underexplored. Understanding these dynamics has broad applications in attention modeling and vision-based interfaces. In this study, we conducted a large scale human subject experiment involving 40 participants viewing 50 diverse images under unconstrained conditions, recording over 4 million gaze points using a high speed eye tracker. Analysis of these data shows that the gaze trajectory of the human eye also follows a Levy walk akin to animal foraging. This suggests that the human eye forages for visual information in an optimally efficient manner. Further, we trained a convolutional neural network (CNN) to predict fixation heatmaps from image input alone. The model accurately reproduced salient fixation regions across novel images, demonstrating that key components of gaze behavior are learnable from visual structure alone. Our findings present new evidence that human visual exploration obeys statistical laws analogous to natural foraging and open avenues for modeling gaze through generative and predictive frameworks.
☆ Spotlight on Token Perception for Multimodal Reinforcement Learning
While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
comment: 31 pages, 10 figures, project page: https://github.com/huaixuheqing/VPPO-RL
☆ MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
☆ Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects
Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA's actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at https://goba-attack.github.io/.
☆ Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy
To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
comment: Code is available: https://github.com/TruhnLab/VisionSemanticEntropy
☆ Zero-shot image privacy classification with Vision-Language Models
While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
comment: 5 pages, 3 figures, 3 tables. This work has been submitted to the ICASSP 2026
☆ Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras
Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.
☆ Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation
Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration
comment: This work has been submitted to IEEE for possible publication
☆ Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation
Cross-Domain Sequential Recommendation (CDSR) plays a crucial role in modern consumer electronics and e-commerce platforms, where users interact with diverse services such as books, movies, and online retail products. These systems must accurately capture both domain-specific and cross-domain behavioral patterns to provide personalized and seamless consumer experiences. To address this challenge, we propose \textbf{TEMA-LLM} (\textit{Tag-Enriched Multi-Attention with Large Language Models}), a practical and effective framework that integrates \textit{Large Language Models (LLMs)} for semantic tag generation and enrichment. Specifically, TEMA-LLM employs LLMs to assign domain-aware prompts and generate descriptive tags from item titles and descriptions. The resulting tag embeddings are fused with item identifiers as well as textual and visual features to construct enhanced item representations. A \textit{Tag-Enriched Multi-Attention} mechanism is then introduced to jointly model user preferences within and across domains, enabling the system to capture complex and evolving consumer interests. Extensive experiments on four large-scale e-commerce datasets demonstrate that TEMA-LLM consistently outperforms state-of-the-art baselines, underscoring the benefits of LLM-based semantic tagging and multi-attention integration for consumer-facing recommendation systems. The proposed approach highlights the potential of LLMs to advance intelligent, user-centric services in the field of consumer electronics.
comment: Accepted in IEEE Transactions on Consumer Electronics 2025
☆ Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
comment: Project Page: https://stable-video-infinity.github.io/homepage/
☆ 3D Reconstruction from Transient Measurements with Time-Resolved Transformer
Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.
☆ Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition
Cattle behaviour is a crucial indicator of an individual animal health, productivity and overall well-being. Video-based monitoring, combined with deep learning techniques, has become a mainstream approach in animal biometrics, and it can offer high accuracy in some behaviour recognition tasks. We present Cattle-CLIP, a multimodal deep learning framework for cattle behaviour recognition, using semantic cues to improve the performance of video-based visual feature recognition. It is adapted from the large-scale image-language model CLIP by adding a temporal integration module. To address the domain gap between web data used for the pre-trained model and real-world cattle surveillance footage, we introduce tailored data augmentation strategies and specialised text prompts. Cattle-CLIP is evaluated under both fully-supervised and few-shot learning scenarios, with a particular focus on data-scarce behaviour recognition - an important yet under-explored goal in livestock monitoring. To evaluate the proposed method, we release the CattleBehaviours6 dataset, which comprises six types of indoor behaviours: feeding, drinking, standing-self-grooming, standing-ruminating, lying-self-grooming and lying-ruminating. The dataset consists of 1905 clips collected from our John Oldacre Centre dairy farm research platform housing 200 Holstein-Friesian cows. Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviours, and demonstrates robust generalisation with limited data in few-shot scenarios, highlighting the potential of multimodal learning in agricultural and animal behaviour analysis.
comment: 16 pages, 10 figures, submitted to Computers and Electronics in Agriculture
☆ Towards Safer and Understandable Driver Intention Prediction
Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/
comment: 10 pages
☆ Modern Deep Learning Approaches for Cricket Shot Classification: A Comprehensive Baseline Study
Cricket shot classification from video sequences remains a challenging problem in sports video analysis, requiring effective modeling of both spatial and temporal features. This paper presents the first comprehensive baseline study comparing seven different deep learning approaches across four distinct research paradigms for cricket shot classification. We implement and systematically evaluate traditional CNN-LSTM architectures, attention-based models, vision transformers, transfer learning approaches, and modern EfficientNet-GRU combinations on a unified benchmark. A critical finding of our study is the significant performance gap between claims in academic literature and practical implementation results. While previous papers reported accuracies of 96\% (Balaji LRCN), 99.2\% (IJERCSE), and 93\% (Sensors), our standardized re-implementations achieve 46.0\%, 55.6\%, and 57.7\% respectively. Our modern SOTA approach, combining EfficientNet-B0 with a GRU-based temporal model, achieves 92.25\% accuracy, demonstrating that substantial improvements are possible with modern architectures and systematic optimization. All implementations follow modern MLOps practices with PyTorch Lightning, providing a reproducible research platform that exposes the critical importance of standardized evaluation protocols in sports video analysis research.
☆ Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption
Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.
☆ TARO: Toward Semantically Rich Open-World Object Detection
Modern object detectors are largely confined to a "closed-world" assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as 'Unknown', this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an 'Unknown Animal' (requiring an urgent stop) versus 'Unknown Debris' (requiring a safe lane change) is far more useful than just 'Unknown' in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.
comment: 17 pages, 5 figures
☆ Instance-Level Generation for Representation Learning
Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR's real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.
☆ Online Topological Localization for Navigation Assistance in Bronchoscopy
Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.
☆ Training Feature Attribution for Vision Models
Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
☆ Polar Separable Transform for Efficient Orthogonal Rotation-Invariant Image Representation
Orthogonal moment-based image representations are fundamental in computer vision, but classical methods suffer from high computational complexity and numerical instability at large orders. Zernike and pseudo-Zernike moments, for instance, require coupled radial-angular processing that precludes efficient factorization, resulting in $\mathcal{O}(n^3N^2)$ to $\mathcal{O}(n^6N^2)$ complexity and $\mathcal{O}(N^4)$ condition number scaling for the $n$th-order moments on an $N\times N$ image. We introduce \textbf{PSepT} (Polar Separable Transform), a separable orthogonal transform that overcomes the non-separability barrier in polar coordinates. PSepT achieves complete kernel factorization via tensor-product construction of Discrete Cosine Transform (DCT) radial bases and Fourier harmonic angular bases, enabling independent radial and angular processing. This separable design reduces computational complexity to $\mathcal{O}(N^2 \log N)$, memory requirements to $\mathcal{O}(N^2)$, and condition number scaling to $\mathcal{O}(\sqrt{N})$, representing exponential improvements over polynomial approaches. PSepT exhibits orthogonality, completeness, energy conservation, and rotation-covariance properties. Experimental results demonstrate better numerical stability, computational efficiency, and competitive classification performance on structured datasets, while preserving exact reconstruction. The separable framework enables high-order moment analysis previously infeasible with classical methods, opening new possibilities for robust image analysis applications.
comment: 13 pages, 10 figures, 4 Tables
☆ MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation
Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.
☆ SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding
Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{\text{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{\text{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.
comment: Project website: https://github.com/weikaih04/SOS
☆ A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans
Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.
☆ Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation ICCV 2025
Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
comment: Accepted by ICCV 2025
☆ GL-DT: Multi-UAV Detection and Tracking with Global-Local Integration
The extensive application of unmanned aerial vehicles (UAVs) in military reconnaissance, environmental monitoring, and related domains has created an urgent need for accurate and efficient multi-object tracking (MOT) technologies, which are also essential for UAV situational awareness. However, complex backgrounds, small-scale targets, and frequent occlusions and interactions continue to challenge existing methods in terms of detection accuracy and trajectory continuity. To address these issues, this paper proposes the Global-Local Detection and Tracking (GL-DT) framework. It employs a Spatio-Temporal Feature Fusion (STFF) module to jointly model motion and appearance features, combined with a global-local collaborative detection strategy, effectively enhancing small-target detection. Building upon this, the JPTrack tracking algorithm is introduced to mitigate common issues such as ID switches and trajectory fragmentation. Experimental results demonstrate that the proposed approach significantly improves the continuity and stability of MOT while maintaining real-time performance, providing strong support for the advancement of UAV detection and tracking technologies.
☆ MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling
We present MambaH-Fit, a state space modelling framework tailored for hyper-surface fitting-based point cloud normal estimation. Existing normal estimation methods often fall short in modelling fine-grained geometric structures, thereby limiting the accuracy of the predicted normals. Recently, state space models (SSMs), particularly Mamba, have demonstrated strong modelling capability by capturing long-range dependencies with linear complexity and inspired adaptations to point cloud processing. However, existing Mamba-based approaches primarily focus on understanding global shape structures, leaving the modelling of local, fine-grained geometric details largely under-explored. To address the issues above, we first introduce an Attention-driven Hierarchical Feature Fusion (AHFF) scheme to adaptively fuse multi-scale point cloud patch features, significantly enhancing geometric context learning in local point cloud neighbourhoods. Building upon this, we further propose Patch-wise State Space Model (PSSM) that models point cloud patches as implicit hyper-surfaces via state dynamics, enabling effective fine-grained geometric understanding for normal prediction. Extensive experiments on benchmark datasets show that our method outperforms existing ones in terms of accuracy, robustness, and flexibility. Ablation studies further validate the contribution of the proposed components.
comment: 11 pages, 12 figures
☆ Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array
Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.
comment: Accept by IROS 2025
☆ MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
comment: 4 pages, 4 figures, 2 tables
☆ OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching
Flow-based text-to-image models follow deterministic trajectories, forcing users to repeatedly sample to discover diverse modes, which is a costly and inefficient process. We present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Our procedure requires no retraining or modification to the base sampler and is compatible with common flow-matching solvers. Theoretically, our method is shown to monotonically increase a volume surrogate while, due to its geometric constraints, approximately preserving the marginal distribution. This provides a principled explanation for why generation quality is robustly maintained. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.
☆ Lesion-Aware Post-Training of Latent Diffusion Models for Synthesizing Diffusion MRI from CT Perfusion MICCAI 2025
Image-to-Image translation models can help mitigate various challenges inherent to medical image acquisition. Latent diffusion models (LDMs) leverage efficient learning in compressed latent space and constitute the core of state-of-the-art generative image models. However, this efficiency comes with a trade-off, potentially compromising crucial pixel-level detail essential for high-fidelity medical images. This limitation becomes particularly critical when generating clinically significant structures, such as lesions, which often occupy only a small portion of the image. Failure to accurately reconstruct these regions can severely impact diagnostic reliability and clinical decision-making. To overcome this limitation, we propose a novel post-training framework for LDMs in medical image-to-image translation by incorporating lesion-aware medical pixel space objectives. This approach is essential, as it not only enhances overall image quality but also improves the precision of lesion delineation. We evaluate our framework on brain CT-to-MRI translation in acute ischemic stroke patients, where early and accurate diagnosis is critical for optimal treatment selection and improved patient outcomes. While diffusion MRI is the gold standard for stroke diagnosis, its clinical utility is often constrained by high costs and low accessibility. Using a dataset of 817 patients, we demonstrate that our framework improves overall image quality and enhances lesion delineation when synthesizing DWI and ADC images from CT perfusion scans, outperforming existing image-to-image translation models. Furthermore, our post-training strategy is easily adaptable to pre-trained LDMs and exhibits substantial potential for broader applications across diverse medical image translation tasks.
comment: MICCAI 2025, Lecture Notes in Computer Science Vol. 15961
☆ Auto-scaling Continuous Memory for GUI Agent
We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
☆ Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels
Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.
☆ Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy
In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.
comment: Code is available at https://github.com/krennic999/ARsample
☆ On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models
Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.
☆ Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.
☆ Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation
AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.
☆ HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images
Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.
☆ Hierarchical Scheduling for Multi-Vector Image Retrieval
To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.
comment: Under Review
☆ mmJoints: Expanding Joint Representations Beyond (x,y,z) in mmWave-Based 3D Pose Estimation
In mmWave-based pose estimation, sparse signals and weak reflections often cause models to infer body joints from statistical priors rather than sensor data. While prior knowledge helps in learning meaningful representations, over-reliance on it degrades performance in downstream tasks like gesture and activity recognition. In this paper, we introduce mmJoints, a framework that augments a pre-trained, black-box mmWave-based 3D pose estimator's output with additional joint descriptors. Rather than mitigating bias, mmJoints makes it explicit by estimating the likelihood of a joint being sensed and the reliability of its predicted location. These descriptors enhance interpretability and improve downstream task accuracy. Through extensive evaluations using over 115,000 signal frames across 13 pose estimation settings, we show that mmJoints estimates descriptors with an error rate below 4.2%. mmJoints also improves joint position accuracy by up to 12.5% and boosts activity recognition by up to 16% over state-of-the-art methods.
☆ SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation
Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2-3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter-slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self-supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2-3dMed significantly outperforms state-of-the-art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video-centric foundation models to spatial volumetric data.
☆ Unleashing Perception-Time Scaling to Multimodal Reasoning Models
Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.
☆ Denoised Diffusion for Object-Focused Image Augmentation
Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
☆ FS-RWKV: Leveraging Frequency Spatial-Aware RWKV for 3T-to-7T MRI Translation
Ultra-high-field 7T MRI offers enhanced spatial resolution and tissue contrast that enables the detection of subtle pathological changes in neurological disorders. However, the limited availability of 7T scanners restricts widespread clinical adoption due to substantial infrastructure costs and technical demands. Computational approaches for synthesizing 7T-quality images from accessible 3T acquisitions present a viable solution to this accessibility challenge. Existing CNN approaches suffer from limited spatial coverage, while Transformer models demand excessive computational overhead. RWKV architectures offer an efficient alternative for global feature modeling in medical image synthesis, combining linear computational complexity with strong long-range dependency capture. Building on this foundation, we propose Frequency Spatial-RWKV (FS-RWKV), an RWKV-based framework for 3T-to-7T MRI translation. To better address the challenges of anatomical detail preservation and global tissue contrast recovery, FS-RWKV incorporates two key modules: (1) Frequency-Spatial Omnidirectional Shift (FSO-Shift), which performs discrete wavelet decomposition followed by omnidirectional spatial shifting on the low-frequency branch to enhance global contextual representation while preserving high-frequency anatomical details; and (2) Structural Fidelity Enhancement Block (SFEB), a module that adaptively reinforces anatomical structure through frequency-aware feature fusion. Comprehensive experiments on UNC and BNU datasets demonstrate that FS-RWKV consistently outperforms existing CNN-, Transformer-, GAN-, and RWKV-based baselines across both T1w and T2w modalities, achieving superior anatomical fidelity and perceptual quality.
comment: Accepted by BIBM 2025
☆ Progressive Uncertainty-Guided Evidential U-KAN for Trustworthy Medical Image Segmentation
Trustworthy medical image segmentation aims at deliver accurate and reliable results for clinical decision-making. Most existing methods adopt the evidence deep learning (EDL) paradigm due to its computational efficiency and theoretical robustness. However, the EDL-based methods often neglect leveraging uncertainty maps rich in attention cues to refine ambiguous boundary segmentation. To address this, we propose a progressive evidence uncertainty guided attention (PEUA) mechanism to guide the model to focus on the feature representation learning of hard regions. Unlike conventional approaches, PEUA progressively refines attention using uncertainty maps while employing low-rank learning to denoise attention weights, enhancing feature learning for challenging regions. Concurrently, standard EDL methods suppress evidence of incorrect class indiscriminately via Kullback-Leibler (KL) regularization, impairing the uncertainty assessment in ambiguous areas and consequently distorts the corresponding attention guidance. We thus introduce a semantic-preserving evidence learning (SAEL) strategy, integrating a semantic-smooth evidence generator and a fidelity-enhancing regularization term to retain critical semantics. Finally, by embedding PEUA and SAEL with the state-of-the-art U-KAN, we proposes Evidential U-KAN, a novel solution for trustworthy medical image segmentation. Extensive experiments on 4 datasets demonstrate superior accuracy and reliability over the competing methods. The code is available at \href{https://anonymous.4open.science/r/Evidence-U-KAN-BBE8}{github}.
☆ Bi-level Meta-Policy Control for Dynamic Uncertainty Calibration in Evidential Deep Learning
Traditional Evidence Deep Learning (EDL) methods rely on static hyperparameter for uncertainty calibration, limiting their adaptability in dynamic data distributions, which results in poor calibration and generalization in high-risk decision-making tasks. To address this limitation, we propose the Meta-Policy Controller (MPC), a dynamic meta-learning framework that adjusts the KL divergence coefficient and Dirichlet prior strengths for optimal uncertainty modeling. Specifically, MPC employs a bi-level optimization approach: in the inner loop, model parameters are updated through a dynamically configured loss function that adapts to the current training state; in the outer loop, a policy network optimizes the KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing prediction accuracy and uncertainty quality. Unlike previous methods with fixed priors, our learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics. Extensive experimental results show that MPC significantly enhances the reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.
☆ RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos
Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.
☆ Defense against Unauthorized Distillation in Image Restoration via Feature Space Perturbation
Knowledge distillation (KD) attacks pose a significant threat to deep model intellectual property by enabling adversaries to train student networks using a teacher model's outputs. While recent defenses in image classification have successfully disrupted KD by perturbing output probabilities, extending these methods to image restoration is difficult. Unlike classification, restoration is a generative task with continuous, high-dimensional outputs that depend on spatial coherence and fine details. Minor perturbations are often insufficient, as students can still learn the underlying mapping.To address this, we propose Adaptive Singular Value Perturbation (ASVP), a runtime defense tailored for image restoration models. ASVP operates on internal feature maps of the teacher using singular value decomposition (SVD). It amplifies the topk singular values to inject structured, high-frequency perturbations, disrupting the alignment needed for distillation. This hinders student learning while preserving the teacher's output quality.We evaluate ASVP across five image restoration tasks: super-resolution, low-light enhancement, underwater enhancement, dehazing, and deraining. Experiments show ASVP reduces student PSNR by up to 4 dB and SSIM by 60-75%, with negligible impact on the teacher's performance. Compared to prior methods, ASVP offers a stronger and more consistent defense.Our approach provides a practical solution to protect open-source restoration models from unauthorized knowledge distillation.
☆ SegTrans: Transferable Adversarial Examples for Segmentation Models
Segmentation models exhibit significant vulnerability to adversarial examples in white-box settings, but existing adversarial attack methods often show poor transferability across different segmentation models. While some researchers have explored transfer-based adversarial attack (i.e., transfer attack) methods for segmentation models, the complex contextual dependencies within these models and the feature distribution gaps between surrogate and target models result in unsatisfactory transfer success rates. To address these issues, we propose SegTrans, a novel transfer attack framework that divides the input sample into multiple local regions and remaps their semantic information to generate diverse enhanced samples. These enhanced samples replace the original ones for perturbation optimization, thereby improving the transferability of adversarial examples across different segmentation models. Unlike existing methods, SegTrans only retains local semantic information from the original input, rather than using global semantic information to optimize perturbations. Extensive experiments on two benchmark datasets, PASCAL VOC and Cityscapes, four different segmentation models, and three backbone networks show that SegTrans significantly improves adversarial transfer success rates without introducing additional computational overhead. Compared to the current state-of-the-art methods, SegTrans achieves an average increase of 8.55% in transfer attack success rate and improves computational efficiency by more than 100%.
comment: Accepted by TMM 2025
☆ PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
comment: 23 pages
☆ Modeling Time-Lapse Trajectories to Characterize Cranberry Growth ICCV
Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).
comment: Accepted to ICCV Workshops 2025
♻ ☆ Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes. Our project page provides additional results, videos, and code.
comment: Accepted to TMLR 2025. Code and models released at https://gaussianscenes.github.io/
♻ ☆ ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition ICCV 2025
Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0-L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition.
comment: Accepted to ICCV 2025. 17 pages, 6 figures, 3 tables
♻ ☆ Understanding Ice Crystal Habit Diversity with Self-Supervised Learning NeurIPS 2025
Ice-containing clouds strongly impact climate, but they are hard to model due to ice crystal habit (i.e., shape) diversity. We use self-supervised learning (SSL) to learn latent representations of crystals from ice crystal imagery. By pre-training a vision transformer with many cloud particle images, we learn robust representations of crystal morphology, which can be used for various science-driven tasks. Our key contributions include (1) validating that our SSL approach can be used to learn meaningful representations, and (2) presenting a relevant application where we quantify ice crystal diversity with these latent representations. Our results demonstrate the power of SSL-driven representations to improve the characterization of ice crystals and subsequently constrain their role in Earth's climate system.
comment: Accepted to NeurIPS 2025 Workshop: Tackling Climate Change with Machine Learning
♻ ☆ Solving Inverse Problems with FLAIR
Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the data likelihood term is usually intractable; (ii) learned generative models cannot be directly conditioned on the distorted observations, leading to conflicting objectives between data likelihood and prior; and (iii) the reconstructions can deviate from the observed data. We present FLAIR, a novel, training-free variational framework that leverages flow-based generative models as prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to guide the prior towards regions which are more likely under the posterior. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity. Our code is available at https://inverseflair.github.io/.
♻ ☆ Continual Adapter Tuning with Semantic Shift Compensation for Class-Incremental Learning CVPR 2024
Class-incremental learning (CIL) aims to enable models to continuously learn new classes while overcoming catastrophic forgetting. The introduction of pre-trained models has brought new tuning paradigms to CIL. In this paper, we revisit different parameter-efficient tuning (PET) methods within the context of continual learning. We observe that adapter tuning demonstrates superiority over prompt-based methods, even without parameter expansion in each learning session. Motivated by this, we propose incrementally tuning the shared adapter without imposing parameter update constraints, enhancing the learning capacity of the backbone. Additionally, we employ feature sampling from stored prototypes to retrain a unified classifier, further improving its performance. We estimate the semantic shift of old prototypes without access to past samples and update stored prototypes session by session. Our proposed method eliminates model expansion and avoids retaining any image samples. It surpasses previous pre-trained model-based CIL methods and demonstrates remarkable continual learning capabilities. Experimental results on five CIL benchmarks validate the effectiveness of our approach, achieving state-of-the-art (SOTA) performance.
comment: Journal extension of SSIAT (CVPR 2024)
♻ ☆ Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation ICML 2025
Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.
comment: ICML 2025
♻ ☆ Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.
comment: Code: https://github.com/ali-vilab/alitok
♻ ☆ CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold. We further evaluate our approach on the CAESAR and ISL Pointing datasets.
♻ ☆ SMF: Template-free and Rig-free Animation Transfer using Kinetic Codes SIGGRAPH
Animation retargetting applies sparse motion description (e.g., keypoint sequences) to a character mesh to produce a semantically plausible and temporally coherent full-body mesh sequence. Existing approaches come with restrictions -- they require access to template-based shape priors or artist-designed deformation rigs, suffer from limited generalization to unseen motion and/or shapes, or exhibit motion jitter. We propose Self-supervised Motion Fields (SMF), a self-supervised framework that is trained with only sparse motion representations, without requiring dataset-specific annotations, templates, or rigs. At the heart of our method are Kinetic Codes, a novel autoencoder-based sparse motion encoding, that exposes a semantically rich latent space, simplifying large-scale training. Our architecture comprises dedicated spatial and temporal gradient predictors, which are jointly trained in an end-to-end fashion. The combined network, regularized by the Kinetic Codes' latent space, has good generalization across both unseen shapes and new motions. We evaluated our method on unseen motion sampled from AMASS, D4D, Mixamo, and raw monocular video for animation transfer on various characters with varying shapes and topology. We report a new SoTA on the AMASS dataset in the context of generalization to unseen motion. Code, weights, and supplementary are available on the project webpage at https://motionfields.github.io/
comment: SIGGRAPH Asia 2025 [ACM Transactions on Graphics] | Project website: https://motionfields.github.io/
♻ ☆ RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness NeurIPS 2025
Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.
comment: NeurIPS 2025 (Spotlight)
♻ ☆ The Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar
LiDAR's dense, sharp point cloud (PC) representations of the surrounding environment enable accurate perception and significantly improve road safety by offering greater scene awareness and understanding. However, LiDAR's high cost continues to restrict the broad adoption of high-level Autonomous Driving (AD) systems in commercially available vehicles. Prior research has shown progress towards circumventing the need for LiDAR by training a neural network, using LiDAR point clouds as ground truth (GT), to produce LiDAR-like 3D point clouds using only 4D Radars. One of the best examples is a neural network created to train a more efficient radar target detector with a modular 2D convolutional neural network (CNN) backbone and a temporal coherence network at its core that uses the RaDelft dataset for training (see arXiv:2406.04723). In this work, we investigate the impact of higher-capacity segmentation backbones on the quality of the produced point clouds. Our results show that while very high-capacity models may actually hurt performance, an optimal segmentation backbone can provide a 23.7% improvement over the state-of-the-art (SOTA).
♻ ☆ Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies, revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://github.com/William030422/Video-Sycophancy.
comment: 28 pages
♻ ☆ Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning NeurIPS 2025
Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have explored training-free methods using first- and second-order statistics to aggregate local client data distributions at the server and achieve high performance without any training. In this work, we propose a training-free method based on an unbiased estimator of class covariance matrices which only uses first-order statistics in the form of class means communicated by clients to the server. We show how these estimated class covariances can be used to initialize the global classifier, thus exploiting the covariances without actually sharing them. We also show that using only within-class covariances results in a better classifier initialization. Our approach improves performance in the range of 4-26% with exactly the same communication cost when compared to methods sharing only class means and achieves performance competitive or superior to methods sharing second-order statistics with dramatically less communication overhead. The proposed method is much more communication-efficient than federated prompt-tuning methods and still outperforms them. Finally, using our method to initialize classifiers and then performing federated fine-tuning or linear probing again yields better performance. Code is available at https://github.com/dipamgoswami/FedCOF.
comment: Accepted at NeurIPS 2025
♻ ☆ The 1st Solution for CARE Liver Task Challenge 2025: Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation
Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.
comment: 11 pages, 3 figures
♻ ☆ Differentially Private 2D Human Pose Estimation
Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first comprehensive framework for differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Next, we incorporate Feature Differential Privacy(FDP) to selectively privatize only sensitive features while retaining public visual cues. Finally, we propose a hybrid feature-projective DP framework that combines both approaches to balance privacy and accuracy for HPE. We evaluate our approach on the MPII dataset across varying privacy budgets, training strategies, and clipping norms. Our combined feature-projective method consistently outperforms vanilla DP-SGD and individual baselines, achieving up to 82.61\% mean PCKh@0.5 at $\epsilon = 0.8$, substantially closing the gap to the non-private performance. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.
♻ ☆ FFT-based Selection and Optimization of Statistics for Robust Recognition of Severely Corrupted Images
Improving model robustness in case of corrupted images is among the key challenges to enable robust vision systems on smart devices, such as robotic agents. Particularly, robust test-time performance is imperative for most of the applications. This paper presents a novel approach to improve robustness of any classification model, especially on severely corrupted images. Our method (FROST) employs high-frequency features to detect input image corruption type, and select layer-wise feature normalization statistics. FROST provides the state-of-the-art results for different models and datasets, outperforming competitors on ImageNet-C by up to 37.1% relative gain, improving baseline of 40.9% mCE on severe corruptions.
comment: ICASSP 2024. Copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other
♻ ☆ A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond
Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.
♻ ☆ What's Inside Your Diffusion Model? A Score-Based Riemannian Metric to Explore the Data Manifold
Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold's contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.
♻ ☆ HoliTom: Holistic Token Merging for Fast Video Large Language Models
Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.
comment: code link: https://github.com/cokeshao/HoliTom
♻ ☆ MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment
We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.
♻ ☆ MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging
Simulating in silico cellular responses to interventions is a promising direction to accelerate high-content image-based assays, critical for advancing drug discovery and gene editing. To support this, we introduce MorphGen, a state-of-the-art diffusion-based generative model for fluorescent microscopy that enables controllable generation across multiple cell types and perturbations. To capture biologically meaningful patterns consistent with known cellular morphologies, MorphGen is trained with an alignment loss to match its representations to the phenotypic embeddings of OpenPhenom, a state-of-the-art biological foundation model. Unlike prior approaches that compress multichannel stains into RGB images -- thus sacrificing organelle-specific detail -- MorphGen generates the complete set of fluorescent channels jointly, preserving per-organelle structures and enabling a fine-grained morphological analysis that is essential for biological interpretation. We demonstrate biological consistency with real images via CellProfiler features, and MorphGen attains an FID score over 35% lower than the prior state-of-the-art MorphoDiff, which only generates RGB images for a single cell type. Code is available at https://github.com/czi-ai/MorphGen.
♻ ☆ Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.
comment: The paper is currently under investigation regarding concerns of potential academic misconduct. While the investigation is ongoing, the authors have voluntarily requested to withdraw the manuscript
♻ ☆ AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.
comment: The paper is currently under investigation regarding concerns of potential academic misconduct. While the investigation is ongoing, the authors have voluntarily requested to withdraw the manuscript
♻ ☆ TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
Identifying key moments in long videos is essential for downstream understanding and reasoning tasks. In this paper, we introduce a new problem, Taskoriented Temporal Grounding ToTG, which aims to localize time intervals containing the necessary information based on a task's natural description. Along with the definition, we also present ToTG Bench, a comprehensive benchmark for evaluating the performance on ToTG. ToTG is particularly challenging for traditional approaches due to their limited generalizability and difficulty in handling long videos. To address these challenges, we propose TimeScope, a novel framework built upon progressive reasoning. TimeScope first identifies a coarse-grained temporal scope in the long video that likely contains the key moments, and then refines this scope through finegrained moment partitioning. Additionally, we curate a highquality dataset, namely ToTG Pile, to enhance TimeScope's ability to perform progressive temporal grounding effectively. Extensive experiments demonstrate that TimeScope consistently outperforms both existing temporalgrounding methods and popular MLLMs across various settings, highlighting its effectiveness in addressing this new challenging problem.
♻ ☆ A Multimodal Depth-Aware Method For Embodied Reference Understanding
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.
♻ ☆ MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis
Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at https://github.com/HiLab-git/MedCAL-Bench.
comment: 23 pages, 6 figures, 10 tables
♻ ☆ Benchmarking and Mitigate Sycophancy in Medical Vision-Language Models
Vision language models(VLMs) are increasingly integrated into clinical workflows, but they often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning. This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark. We propose a medical sycophancy dataset construct from PathVQA, SLAKE, and VQA-RAD stratified by different type organ system and modality. Using psychologically motivated pressure templates including various sycophancy. In our adversarial experiments on various VLMs, we found that these models are generally vulnerable, exhibiting significant variations in the occurrence of adversarial responses, with weak correlations to the model accuracy or size. Imitation and expert provided corrections were found to be the most effective triggers, suggesting that the models possess a bias mechanism independent of visual evidence. To address this, we propose Visual Information Purification for Evidence based Response (VIPER) a lightweight mitigation strategy that filters non evidentiary content for example social pressures and then generates constrained evidence first answers. This framework reduces sycophancy by an average amount outperforming baselines while maintaining interpretability. Our benchmark analysis and mitigation framework lay the groundwork for robust deployment of medical VLMs in real world clinician interactions emphasizing the need for evidence anchored defenses.
comment: 19figures, 37pages
♻ ☆ Multimodal Language Models See Better When They Look Shallower
Multimodal large language models (MLLMs) typically extract visual features from the final layers of a pretrained Vision Transformer (ViT). This widespread deep-layer bias, however, is largely driven by empirical convention rather than principled analysis. While prior studies suggest that different ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, the impact of this variation on MLLM performance remains underexplored. We present the first comprehensive study of visual layer selection for MLLMs, analyzing representation similarity across ViT layers to establish shallow, middle, and deep layer groupings. Through extensive evaluation of MLLMs (1.4B-7B parameters) across 10 benchmarks encompassing 60+ tasks, we find that while deep layers excel in semantic-rich tasks like OCR, shallow and middle layers significantly outperform them on fine-grained visual tasks including counting, positioning, and object localization. Building on these insights, we propose a lightweight feature fusion method that strategically incorporates shallower layers, achieving consistent improvements over both single-layer and specialized fusion baselines. Our work offers the first principled study of visual layer selection in MLLMs, showing that MLLMs can often see better when they look shallower.
comment: 9 pages, 6 figures, accepted by EMNLP2025
♻ ☆ EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision MICCAI 2025
Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt -- an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.
comment: Won Third Place Award at Challenge 5 at BraTS-Lighthouse 2025 Challenge (MICCAI 2025)
♻ ☆ Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer
Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State-of-the-Art performance on both the NYUv2 and SUN-RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at https://diffusionmms.github.io/
♻ ☆ Learning Neural Exposure Fields for View Synthesis NeurIPS 2025
Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.
comment: Accepted to NeurIPS 2025. Project page available at https://m-niemeyer.github.io/nexf/index.html
♻ ☆ Contour Errors: An Ego-Centric Metric for Reliable 3D Multi-Object Tracking
Finding reliable matches is essential in multi-object tracking to ensure the accuracy and reliability of perception systems in safety-critical applications such as autonomous vehicles. Effective matching mitigates perception errors, enhancing object identification and tracking for improved performance and safety. However, traditional metrics such as Intersection over Union (IoU) and Center Point Distances (CPDs), which are effective in 2D image planes, often fail to find critical matches in complex 3D scenes. To address this limitation, we introduce Contour Errors (CEs), an ego or object-centric metric for identifying matches of interest in tracking scenarios from a functional perspective. By comparing bounding boxes in the ego vehicle's frame, contour errors provide a more functionally relevant assessment of object matches. Extensive experiments on the nuScenes dataset demonstrate that contour errors improve the reliability of matches over the state-of-the-art 2D IoU and CPD metrics in tracking-by-detection methods. In 3D car tracking, our results show that Contour Errors reduce functional failures (FPs/FNs) by 80% at close ranges and 60% at far ranges compared to IoU in the evaluation stage.
♻ ☆ TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis ICCV 2025
This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.
comment: Accepted to ICCV 2025. Please visit our project page at https://triton99.github.io/taro-site/
♻ ☆ Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach
Pneumonia, a prevalent respiratory infection, remains a leading cause of morbidity and mortality worldwide, particularly among vulnerable populations. Chest X-rays serve as a primary tool for pneumonia detection; however, variations in imaging conditions and subtle visual indicators complicate consistent interpretation. Automated tools can enhance traditional methods by improving diagnostic reliability and supporting clinical decision-making. In this study, we propose a novel multi-scale transformer approach for pneumonia detection that integrates lung segmentation and classification into a unified framework. Our method introduces a lightweight transformer-enhanced TransUNet for precise lung segmentation, achieving a Dice score of 95.68% on the "Chest X-ray Masks and Labels" dataset with fewer parameters than traditional transformers. For classification, we employ pre-trained ResNet models (ResNet-50 and ResNet-101) to extract multi-scale feature maps, which are then processed through a modified transformer module to enhance pneumonia detection. This integration of multi-scale feature extraction and lightweight transformer modules ensures robust performance, making our method suitable for resource-constrained clinical environments. Our approach achieves 93.75% accuracy on the "Kermany" dataset and 96.04% accuracy on the "Cohen" dataset, outperforming existing methods while maintaining computational efficiency. This work demonstrates the potential of multi-scale transformer architectures to improve pneumonia diagnosis, offering a scalable and accurate solution to global healthcare challenges. https://github.com/amirrezafateh/Multi-Scale-Transformer-Pneumonia
♻ ☆ Robustness in Both Domains: CLIP Needs a Robust Text Encoder NeurIPS 2025
Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).
comment: Accepted in NeurIPS 2025
♻ ☆ PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image
We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work. Project page at: https://panolam.github.io/.
♻ ☆ Brain2Text Decoding Model Reveals the Neural Mechanisms of Visual Semantic Processing
Decoding sensory experiences from neural activity to reconstruct human-perceived visual stimuli and semantic content remains a challenge in neuroscience and artificial intelligence. Despite notable progress in current brain decoding models, a critical gap still persists in their systematic integration with established neuroscientific theories and the exploration of underlying neural mechanisms. Here, we present a novel framework that directly decodes fMRI signals into textual descriptions of viewed natural images. Our novel deep learning model, trained without visual information, achieves state-of-the-art semantic decoding performance, generating meaningful captions that capture the core semantic content of complex scenes. Neuroanatomical analysis reveals the critical role of higher-level visual cortices, including MT+ complex, ventral stream visual cortex, and inferior parietal cortex, in visual semantic processing. Furthermore, category-specific analysis demonstrates nuanced neural representations for semantic dimensions like animacy and motion. This work provides a more direct and interpretable framework to the brain's semantic decoding, offering a powerful new methodology for probing the neural basis of complex semantic processing, refining the understanding of the distributed semantic network, and potentially developing brain-sinpired language models.
comment: 29 pages, 7 figures
♻ ☆ SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images ICCV2025
A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce \textbf{SpatialSplat}, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60\% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code is available at https://github.com/shengyuuu/SpatialSplat.git
comment: ICCV2025
♻ ☆ GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.
comment: 20 pages, 13 figures
♻ ☆ SQ-GAN: Semantic Image Communications Using Masked Vector Quantization
This work introduces Semantically Masked Vector Quantized Generative Adversarial Network (SQ-GAN), a novel approach integrating semantically driven image coding and vector quantization to optimize image compression for semantic/task-oriented communications. The method only acts on source coding and is fully compliant with legacy systems. The semantics is extracted from the image computing its semantic segmentation map using off-the-shelf software. A new specifically developed semantic-conditioned adaptive mask module (SAMM) selectively encodes semantically relevant features of the image. The relevance of the different semantic classes is task-specific, and it is incorporated in the training phase by introducing appropriate weights in the loss function. SQ-GAN outperforms state-of-the-art image compression schemes such as JPEG2000, BPG, and deep-learning based methods across multiple metrics, including perceptual quality and semantic segmentation accuracy on the reconstructed image, at extremely low compression rates.
comment: arXiv admin note: substantial text overlap with arXiv:2502.01675
♻ ☆ ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection
Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.
comment: 4 pages, 1 reference, 3 figures
♻ ☆ Unified Multimodal Model as Auto-Encoder
The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the Auto-Encoder lens, i.e., regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. To implement this, we propose UAE, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to "understand" the fine-grained and complex semantics from the text. We then propose Unified-GRPO via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual perception; (2) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.
♻ ☆ DiffMark: Diffusion-based Robust Watermark Against Deepfakes
Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at https://github.com/vpsg-research/DiffMark.
♻ ☆ RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation
Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel- and point-based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine-grained geometry, they often incur high computational cost, irregular memory access, and limited real-time efficiency. In contrast, range-view methods, though relatively underexplored - can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero-shot recognition, and multimodal tasks, we investigate whether SAM2, the current state-of-the-art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds. To optimize SAM2 for range-view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range-view pseudo-images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines. This work highlights the viability of VFMs as general-purpose backbones for 3D perception and opens a path toward unified, foundation-model-driven LiDAR segmentation. Results lets us conclude that range-view segmentation methods using VFMs leads to promising results.
♻ ☆ AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving
With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.
comment: We believe that the contribution of this paper is not enough, so we integrated it into another new paper. The arXiv ID of the new paper is arXiv:2510.01795
♻ ☆ DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images AAAI 2024
The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.
comment: Defactify 3 workshop at AAAI 2024
♻ ☆ RadVLM: A Multitask Conversational Vision-Language Model for Radiology
The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.
comment: 21 pages, 15 figures
♻ ☆ G$^2$RPO: Granular GRPO for Precise Reward in Flow Models
The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G$^2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G$^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
comment: Project Page: https://bujiazi.github.io/g2rpo.github.io/
♻ ☆ RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models NeurIPS 2025
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC
comment: Accepted to NeurIPS 2025
♻ ☆ LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce LaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: https://github.com/HJNVR/LaV-CoT
comment: 12 Pages, 12 Figures, 3 Tables
♻ ☆ RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models
Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.
♻ ☆ CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection NeurIPS 2025
With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The code is publicly at https://github.com/FireRedTeam/CQ-DINO.
comment: Accepted at NeurIPS 2025
♻ ☆ From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding
Selecting informative keyframes is critical for efficient video understanding, yet existing approaches often rely on heuristics, ignore semantics, or produce redundant frames. We propose KeyScore, a caption-aware frame scoring method that combines three complementary signals: semantic similarity to captions, temporal representativeness, and contextual drop impact. Applied to large-scale video-caption datasets, KeyScore generates frame-level importance scores that enable training keyframe extractors or guiding video-language models. To support this, we also propose STACFP, a Spatio-Temporal Adaptive Clustering method that generates diverse and compact frame proposals across long videos. Together, KeyScore and STACFP reduce uninformative frames while preserving critical content, resulting in faster and more accurate inference. Our experiments on three standard video-language benchmarks (MSRVTT, MSVD, DiDeMo) show that combining STACFP and KeyScore enables up to 99% frame reduction compared to full-frame processing, while outperforming uniform 8-frame encoders in video-text retrieval, keyframe extraction, and action recognition tasks. By focusing on semantically relevant frames, our method enhances both efficiency and performance, enabling scalable and caption-grounded video understanding.
comment: 10 pages, 4 figures
♻ ☆ SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts
In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.
comment: 4 figures, 6 tables, 12 pages
♻ ☆ MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder
Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of $10.1$ overall Dice score and $0.46$ MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.
comment: Official implementation: https://github.com/chris-beischl/multimae-for-brain-mri
♻ ☆ Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.
comment: project page: https://vivocameraresearch.github.io/any2bokeh/
♻ ☆ Visual Representation Alignment for Multimodal Large Language Models
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.
comment: Project Page: https://cvlab-kaist.github.io/VIRAL/
♻ ☆ One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting NeurIPS 2025
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.
comment: 25 pages, 11 figures, to appear at NeurIPS 2025
♻ ☆ TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation NeurIPS 2025
Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR's hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.
comment: Accepted to NeurIPS 2025
♻ ☆ Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context (MS-COCO) dataset, which contains human-annotated captions with an average length of approximately ten tokens. Although effective for general scene understanding, these short captions often fail to capture complex scenes and convey detailed information. Moreover, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects, thus overlooking finer details. In this paper, we present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models. Our proposed method requires no additional model training: given an image, it leverages pre-trained models from the literature to generate the initial captions, and then ranks them using a newly introduced image-text-based metric, which we name BLIPScore. Subsequently, the top two captions are fused using a Large Language Model (LLM) to produce the final, more detailed description. Experimental results on the MS-COCO and Flickr30k test sets demonstrate the effectiveness of our approach in terms of caption-image alignment and hallucination reduction according to the ALOHa, CAPTURE, and Polos metrics. A subjective study lends additional support to these results, suggesting that the captions produced by our model are generally perceived as more consistent with human judgment. By combining the strengths of diverse SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich and informative nature of human-generated descriptions. This advance enables the generation of more suitable captions for the training of both vision-language and captioning models.
comment: This manuscript has been accepted for publication in Springer Neural Computing and Applications
♻ ☆ Post-training quantization of vision encoders needs prefixing registers
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
♻ ☆ Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention NeurIPS 2025
Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\texttt{CLS}] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose {HoloV}, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1.5 equipped with HoloV preserves 95.8\% of the original performance after pruning 88.9\% of visual tokens, achieving superior efficiency-accuracy trade-offs.
comment: Accepted by NeurIPS 2025 main
♻ ☆ TBStar-Edit: From Image Editing Pattern Shifting to Consistency Enhancement
Recent advances in image generation and editing technologies have enabled state-of-the-art models to achieve impressive results in general domains. However, when applied to e-commerce scenarios, these general models often encounter consistency limitations. To address this challenge, we introduce TBStar-Edit, an new image editing model tailored for the e-commerce domain. Through rigorous data engineering, model architecture design and training strategy, TBStar-Edit achieves precise and high-fidelity image editing while maintaining the integrity of product appearance and layout. Specifically, for data engineering, we establish a comprehensive data construction pipeline, encompassing data collection, construction, filtering, and augmentation, to acquire high-quality, instruction-following, and strongly consistent editing data to support model training. For model architecture design, we design a hierarchical model framework consisting of a base model, pattern shifting modules, and consistency enhancement modules. For model training, we adopt a two-stage training strategy to enhance the consistency preservation: first stage for editing pattern shifting, and second stage for consistency enhancement. Each stage involves training different modules with separate datasets. Finally, we conduct extensive evaluations of TBStar-Edit on a self-proposed e-commerce benchmark, and the results demonstrate that TBStar-Edit outperforms existing general-domain editing models in both objective metrics (VIE Score) and subjective user preference.
♻ ☆ Frequency-Aware Ensemble Learning for BraTS 2025 Pediatric Brain Tumor Segmentation
Pediatric brain tumor segmentation presents unique challenges due to the rarity and heterogeneity of these malignancies, yet remains critical for clinical diagnosis and treatment planning. We propose an ensemble approach integrating nnU-Net, Swin UNETR, and HFF-Net for the BraTS-PED 2025 challenge. Our method incorporates three key extensions: adjustable initialization scales for optimal nnU-Net complexity control, transfer learning from BraTS 2021 pre-trained models to enhance Swin UNETR's generalization on pediatric dataset, and frequency domain decomposition for HFF-Net to separate low-frequency tissue contours from high-frequency texture details. Our final ensemble framework combines nnU-Net ($\gamma=0.7$), fine-tuned Swin UNETR, and HFF-Net, achieving Dice scores of 62.7% (CC), 83.2% (ED), 72.9% (ET), 85.7% (NET), 91.8% (TC), and 92.6% (WT) on the unseen test dataset, respectively. Our proposed method achieves first place (rank 1st) in the BraTS 2025 Pediatric Brain Tumor Segmentation Challenge.
comment: 11 pages, 3 figures, conference, miccai brats challenge
♻ ☆ DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models NeurIPS 2025
Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
comment: NeurIPS 2025 Spotlight. Project page: https://snap-research.github.io/DenseDPO/
♻ ☆ Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting
Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset has been released publicly.
♻ ☆ UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at https://github.com/nnnth/UniLIP.
♻ ☆ ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving
The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).
♻ ☆ RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation ICCV 2025
Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
comment: Accept by ICCV 2025 (Highlight). Project website: https://colorful-liyu.github.io/RAGDiffusion-page/
♻ ☆ VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones
Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.
comment: 19 pages
♻ ☆ UltraSeP: Sequence-aware Pre-training for Echocardiography Probe Movement Guidance
Echocardiography is an essential medical technique for diagnosing cardiovascular diseases, but its high operational complexity has led to a shortage of trained professionals. To address this issue, we introduce a novel probe movement guidance algorithm that has the potential to be applied in guiding robotic systems or novices with probe pose adjustment for high-quality standard plane image acquisition.Cardiac ultrasound faces two major challenges: (1) the inherently complex structure of the heart, and (2) significant individual variations. Previous works have only learned the population-averaged structure of the heart rather than personalized cardiac structures, leading to a performance bottleneck. Clinically, we observe that sonographers dynamically adjust their interpretation of a patient's cardiac anatomy based on prior scanning sequences, consequently refining their scanning strategies. Inspired by this, we propose a novel sequence-aware self-supervised pre-training method. Specifically, our approach learns personalized three-dimensional cardiac structural features by predicting the masked-out image features and probe movement actions in a scanning sequence. We hypothesize that if the model can predict the missing content it has acquired a good understanding of personalized cardiac structure. Extensive experiments on a large-scale expert scanning dataset with 1.67 million samples demonstrate that our proposed sequence-aware paradigm can effectively reduce probe guidance errors compared to other advanced baseline methods.
comment: Accepted by Pattern Recognition
♻ ☆ Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants
Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.
comment: S.G.-F., M.B., and A.E. contributed equally to this work and share first authorship. M.Z. and P.F. contributed equally to this work and share senior authorship
♻ ☆ Enhancing Feature Fusion of U-like Networks with Dynamic Skip Connections
U-like networks have become fundamental frameworks in medical image segmentation through skip connections that bridge high-level semantics and low-level spatial details. Despite their success, conventional skip connections exhibit two key limitations: inter-feature constraints and intra-feature constraints. The inter-feature constraint refers to the static nature of feature fusion in traditional skip connections, where information is transmitted along fixed pathways regardless of feature content. The intra-feature constraint arises from the insufficient modeling of multi-scale feature interactions, thereby hindering the effective aggregation of global contextual information. To overcome these limitations, we propose a novel Dynamic Skip Connection (DSC) block that fundamentally enhances cross-layer connectivity through adaptive mechanisms. The DSC block integrates two complementary components. (1) Test-Time Training (TTT) module. This module addresses the inter-feature constraint by enabling dynamic adaptation of hidden representations during inference, facilitating content-aware feature refinement. (2) Dynamic Multi-Scale Kernel (DMSK) module. To mitigate the intra-feature constraint, this module adaptively selects kernel sizes based on global contextual cues, enhancing the network capacity for multi-scale feature integration. The DSC block is architecture-agnostic and can be seamlessly incorporated into existing U-like network structures. Extensive experiments demonstrate the plug-and-play effectiveness of the proposed DSC block across CNN-based, Transformer-based, hybrid CNN-Transformer, and Mamba-based U-like networks.
♻ ☆ The Role of Video Generation in Enhancing Data-Limited Action Understanding
Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.
comment: IJCAI2025
♻ ☆ Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges
Video event detection has become a cornerstone of modern sports analytics, powering automated performance evaluation, content generation, and tactical decision-making. Recent advances in deep learning have driven progress in related tasks such as Temporal Action Localization (TAL), which detects extended action segments; Action Spotting (AS), which identifies a representative timestamp; and Precise Event Spotting (PES), which pinpoints the exact frame of an event. Although closely connected, their subtle differences often blur the boundaries between them, leading to confusion in both research and practical applications. Furthermore, prior surveys either address generic video event detection or broader sports video tasks, but largely overlook the unique temporal granularity and domain-specific challenges of event spotting. In addition, most existing sports video surveys focus on elite-level competitions while neglecting the wider community of everyday practitioners. This survey addresses these gaps by: (i) clearly delineating TAL, AS, and PES and their respective use cases; (ii) introducing a structured taxonomy of state of the art approaches including temporal modeling strategies, multimodal frameworks, and data-efficient pipelines tailored for AS and PES; and (iii) critically assessing benchmark datasets and evaluation protocols, highlighting limitations such as reliance on broadcast quality footage and metrics that over reward permissive multilabel predictions. By synthesizing current research and exposing open challenges, this work provides a comprehensive foundation for developing temporally precise, generalizable, and practically deployable sports event detection systems for both the research and industry communities.
comment: 28 pages
♻ ☆ SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks
This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.
Machine Learning 262
☆ Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.
comment: Our code and data are available at https://github.com/VILA-Lab/PTTS
☆ BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards
Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.
☆ LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestant performance, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results will be made publicly available on our website.
☆ MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts
Dynamical systems in the life sciences are often composed of complex mixtures of overlapping behavioral regimes. Cellular subpopulations may shift from cycling to equilibrium dynamics or branch towards different developmental fates. The transitions between these regimes can appear noisy and irregular, posing a serious challenge to traditional, flow-based modeling techniques which assume locally smooth dynamics. To address this challenge, we propose MODE (Mixture Of Dynamical Experts), a graphical modeling framework whose neural gating mechanism decomposes complex dynamics into sparse, interpretable components, enabling both the unsupervised discovery of behavioral regimes and accurate long-term forecasting across regime transitions. Crucially, because agents in our framework can jump to different governing laws, MODE is especially tailored to the aforementioned noisy transitions. We evaluate our method on a battery of synthetic and real datasets from computational biology. First, we systematically benchmark MODE on an unsupervised classification task using synthetic dynamical snapshot data, including in noisy, few-sample settings. Next, we show how MODE succeeds on challenging forecasting tasks which simulate key cycling and branching processes in cell biology. Finally, we deploy our method on human, single-cell RNA sequencing data and show that it can not only distinguish proliferation from differentiation dynamics but also predict when cells will commit to their ultimate fate, a key outstanding challenge in computational biology.
comment: 30 pages, 5 figures
☆ STaTS: Structure-Aware Temporal Sequence Summarization via Statistical Window Merging
Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90\% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.
comment: 10 pages, 5 figures, 4 tables. Under Review
☆ Three Birds with One Stone: Improving Performance, Convergence, and System Throughput with Nest
Variational quantum algorithms (VQAs) have the potential to demonstrate quantum utility on near-term quantum computers. However, these algorithms often get executed on the highest-fidelity qubits and computers to achieve the best performance, causing low system throughput. Recent efforts have shown that VQAs can be run on low-fidelity qubits initially and high-fidelity qubits later on to still achieve good performance. We take this effort forward and show that carefully varying the qubit fidelity map of the VQA over its execution using our technique, Nest, does not just (1) improve performance (i.e., help achieve close to optimal results), but also (2) lead to faster convergence. We also use Nest to co-locate multiple VQAs concurrently on the same computer, thus (3) increasing the system throughput, and therefore, balancing and optimizing three conflicting metrics simultaneously.
comment: This paper will appear in the Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), 2026
☆ Automated Evolutionary Optimization for Resource-Efficient Neural Network Training
There are many critical challenges in optimizing neural network models, including distributed computing, compression techniques, and efficient training, regardless of their application to specific tasks. Solving such problems is crucial because the need for scalable and resource-efficient models is increasing. To address these challenges, we have developed a new automated machine learning (AutoML) framework, Parameter Efficient Training with Robust Automation (PETRA). It applies evolutionary optimization to model architecture and training strategy. PETRA includes pruning, quantization, and loss regularization. Experimental studies on real-world data with financial event sequences, as well as image and time-series -- benchmarks, demonstrate PETRA's ability to improve neural model performance and scalability -- namely, a significant decrease in model size (up to 75%) and latency (up to 33%), and an increase in throughput (by 13%) without noticeable degradation in the target metric.
☆ Conditional Flow Matching for Bayesian Posterior Inference
We propose a generative multivariate posterior sampler via flow matching. It offers a simple training objective, and does not require access to likelihood evaluation. The method learns a dynamic, block-triangular velocity field in the joint space of data and parameters, which results in a deterministic transport map from a source distribution to the desired posterior. The inverse map, named vector rank, is accessible by reversibly integrating the velocity over time. It is advantageous to leverage the dynamic design: proper constraints on the velocity yield a monotone map, which leads to a conditional Brenier map, enabling a fast and simultaneous generation of Bayesian credible sets whose contours correspond to level sets of Monge-Kantorovich data depth. Our approach is computationally lighter compared to GAN-based and diffusion-based counterparts, and is capable of capturing complex posterior structures. Finally, frequentist theoretical guarantee on the consistency of the recovered posterior distribution, and of the corresponding Bayesian credible sets, is provided.
☆ Interpretable Generative and Discriminative Learning for Multimodal and Incomplete Clinical Data
Real-world clinical problems are often characterized by multimodal data, usually associated with incomplete views and limited sample sizes in their cohorts, posing significant limitations for machine learning algorithms. In this work, we propose a Bayesian approach designed to efficiently handle these challenges while providing interpretable solutions. Our approach integrates (1) a generative formulation to capture cross-view relationships with a semi-supervised strategy, and (2) a discriminative task-oriented formulation to identify relevant information for specific downstream objectives. This dual generative-discriminative formulation offers both general understanding and task-specific insights; thus, it provides an automatic imputation of the missing views while enabling robust inference across different data sources. The potential of this approach becomes evident when applied to the multimodal clinical data, where our algorithm is able to capture and disentangle the complex interactions among biological, psychological, and sociodemographic modalities.
☆ Geo-Aware Models for Stream Temperature Prediction across Different Spatial Regions and Scales
Understanding environmental ecosystems is vital for the sustainable management of our planet. However,existing physics-based and data-driven models often fail to generalize to varying spatial regions and scales due to the inherent data heterogeneity presented in real environmental ecosystems. This generalization issue is further exacerbated by the limited observation samples available for model training. To address these issues, we propose Geo-STARS, a geo-aware spatio-temporal modeling framework for predicting stream water temperature across different watersheds and spatial scales. The major innovation of Geo-STARS is the introduction of geo-aware embedding, which leverages geographic information to explicitly capture shared principles and patterns across spatial regions and scales. We further integrate the geo-aware embedding into a gated spatio-temporal graph neural network. This design enables the model to learn complex spatial and temporal patterns guided by geographic and hydrological context, even with sparse or no observational data. We evaluate Geo-STARS's efficacy in predicting stream water temperature, which is a master factor for water quality. Using real-world datasets spanning 37 years across multiple watersheds along the eastern coast of the United States, Geo-STARS demonstrates its superior generalization performance across both regions and scales, outperforming state-of-the-art baselines. These results highlight the promise of Geo-STARS for scalable, data-efficient environmental monitoring and decision-making.
☆ A methodology for clinically driven interactive segmentation evaluation
Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.
comment: 10 pages, Medical Image Computing and Computed Assisted Intervention 2025
☆ Unsupervised full-field Bayesian inference of orthotropic hyperelasticity from a single biaxial test: a myocardial case study
Fully capturing this behavior in traditional homogenized tissue testing requires the excitation of multiple deformation modes, i.e. combined triaxial shear tests and biaxial stretch tests. Inherently, such multimodal experimental protocols necessitate multiple tissue samples and extensive sample manipulations. Intrinsic inter-sample variability and manipulation-induced tissue damage might have an adverse effect on the inversely identified tissue behavior. In this work, we aim to overcome this gap by focusing our attention to the use of heterogeneous deformation profiles in a parameter estimation problem. More specifically, we adapt EUCLID, an unsupervised method for the automated discovery of constitutive models, towards the purpose of parameter identification for highly nonlinear, orthotropic constitutive models using a Bayesian inference approach and three-dimensional continuum elements. We showcase its strength to quantitatively infer, with varying noise levels, the material model parameters of synthetic myocardial tissue slabs from a single heterogeneous biaxial stretch test. This method shows good agreement with the ground-truth simulations and with corresponding credibility intervals. Our work highlights the potential for characterizing highly nonlinear and orthotropic material models from a single biaxial stretch test with uncertainty quantification.
☆ Performance Analysis of Machine Learning Algorithms in Chronic Kidney Disease Prediction
Kidneys are the filter of the human body. About 10% of the global population is thought to be affected by Chronic Kidney Disease (CKD), which causes kidney function to decline. To protect in danger patients from additional kidney damage, effective risk evaluation of CKD and appropriate CKD monitoring are crucial. Due to quick and precise detection capabilities, Machine Learning models can help practitioners accomplish this goal efficiently; therefore, an enormous number of diagnosis systems and processes in the healthcare sector nowadays are relying on machine learning due to its disease prediction capability. In this study, we designed and suggested disease predictive computer-aided designs for the diagnosis of CKD. The dataset for CKD is attained from the repository of machine learning of UCL, with a few missing values; those are filled in using "mean-mode" and "Random sampling method" strategies. After successfully achieving the missing data, eight ML techniques (Random Forest, SVM, Naive Bayes, Logistic Regression, KNN, XGBoost, Decision Tree, and AdaBoost) were used to establish models, and the performance evaluation comparisons among the result accuracies are measured by the techniques to find the machine learning models with the highest accuracy. Among them, Random Forest as well as Logistic Regression showed an outstanding 99% accuracy, followed by the Ada Boost, XGBoost, Naive Bayes, Decision Tree, and SVM, whereas the KNN classifier model stands last with an accuracy of 73%.
comment: 11 pages, 7 figures, Presented at the 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0417-0423
☆ Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning
We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $\epsilon$ and the policy variance $\sigma^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.
comment: 48 pages, 3 figures, 4 tables
Locally Optimal Private Sampling: Beyond the Global Minimax NeurIPS 2025
We study the problem of sampling from a distribution under local differential privacy (LDP). Given a private distribution $P \in \mathcal{P}$, the goal is to generate a single sample from a distribution that remains close to $P$ in $f$-divergence while satisfying the constraints of LDP. This task captures the fundamental challenge of producing realistic-looking data under strong privacy guarantees. While prior work by Park et al. (NeurIPS'24) focuses on global minimax-optimality across a class of distributions, we take a local perspective. Specifically, we examine the minimax risk in a neighborhood around a fixed distribution $P_0$, and characterize its exact value, which depends on both $P_0$ and the privacy level. Our main result shows that the local minimax risk is determined by the global minimax risk when the distribution class $\mathcal{P}$ is restricted to a neighborhood around $P_0$. To establish this, we (1) extend previous work from pure LDP to the more general functional LDP framework, and (2) prove that the globally optimal functional LDP sampler yields the optimal local sampler when constrained to distributions near $P_0$. Building on this, we also derive a simple closed-form expression for the locally minimax-optimal samplers which does not depend on the choice of $f$-divergence. We further argue that this local framework naturally models private sampling with public data, where the public data distribution is represented by $P_0$. In this setting, we empirically compare our locally optimal sampler to existing global methods, and demonstrate that it consistently outperforms global minimax samplers.
comment: 44 pages, 11 figures. Accepted at NeurIPS 2025
☆ CRPS-LAM: Regional ensemble weather forecasting from matching marginals
Machine learning for weather prediction increasingly relies on ensemble methods to provide probabilistic forecasts. Diffusion-based models have shown strong performance in Limited-Area Modeling (LAM) but remain computationally expensive at sampling time. Building on the success of global weather forecasting models trained based on Continuous Ranked Probability Score (CRPS), we introduce CRPS-LAM, a probabilistic LAM forecasting model trained with a CRPS-based objective. By sampling and injecting a single latent noise vector into the model, CRPS-LAM generates ensemble members in a single forward pass, achieving sampling speeds up to 39 times faster than a diffusion-based model. We evaluate the model on the MEPS regional dataset, where CRPS-LAM matches the low errors of diffusion models. By retaining also fine-scale forecast details, the method stands out as an effective approach for probabilistic regional weather forecasting
comment: Preprint
Efficient Autoregressive Inference for Transformer Probabilistic Models
Transformer-based models for amortized probabilistic inference, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many real-world applications, from signal interpolation to multi-column tabular predictions, require coherent joint distributions that capture dependencies between predictions. While purely autoregressive architectures efficiently generate such distributions, they sacrifice the flexible set-conditioning that makes these models powerful for meta-learning. Conversely, the standard approach to obtain joint distributions from set-based models requires expensive re-encoding of the entire augmented conditioning set at each autoregressive step. We introduce a causal autoregressive buffer that preserves the advantages of both paradigms. Our approach decouples context encoding from updating the conditioning set. The model processes the context once and caches it. A dynamic buffer then captures target dependencies: as targets are incorporated, they enter the buffer and attend to both the cached context and previously buffered targets. This enables efficient batched autoregressive generation and one-pass joint log-likelihood evaluation. A unified training strategy allows seamless integration of set-based and autoregressive modes at minimal additional cost. Across synthetic functions, EEG signals, cognitive models, and tabular data, our method matches predictive accuracy of strong baselines while delivering up to 20 times faster joint sampling. Our approach combines the efficiency of autoregressive generative models with the representational power of set-based conditioning, making joint prediction practical for transformer-based probabilistic models.
☆ Few-shot multi-token DreamBooth with LoRa for style-consistent character generation
The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.
☆ D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
☆ Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic
Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.
☆ Geodesic Calculus on Latent Spaces
Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.
☆ Interpretable Machine Learning for Predicting Startup Funding, Patenting, and Exits
This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.
☆ Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-$n$ attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.
☆ Failure Prediction at Runtime for Generative Robot Policies NeurIPS 2025
Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy's embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at https://tum-lsy.github.io/fiper_website.
comment: Accepted to NeurIPS 2025
☆ SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests
Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.
comment: 8 pages, 5 figures
☆ On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification
Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.
☆ Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models
Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.
comment: 19 pages
☆ Bandits with Single-Peaked Preferences and Limited Resources
We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
☆ The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
☆ Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models
Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.
comment: 8 pages, 6 figures
☆ Active Model Selection for Large Language Models
We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.
☆ What Do Temporal Graph Learning Models Learn?
Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.
☆ Estimating Brain Activity with High Spatial and Temporal Resolution using a Naturalistic MEG-fMRI Encoding Model
Current non-invasive neuroimaging techniques trade off between spatial resolution and temporal resolution. While magnetoencephalography (MEG) can capture rapid neural dynamics and functional magnetic resonance imaging (fMRI) can spatially localize brain activity, a unified picture that preserves both high resolutions remains an unsolved challenge with existing source localization or MEG-fMRI fusion methods, especially for single-trial naturalistic data. We collected whole-head MEG when subjects listened passively to more than seven hours of narrative stories, using the same stimuli in an open fMRI dataset (LeBel et al., 2023). We developed a transformer-based encoding model that combines the MEG and fMRI from these two naturalistic speech comprehension experiments to estimate latent cortical source responses with high spatiotemporal resolution. Our model is trained to predict MEG and fMRI from multiple subjects simultaneously, with a latent layer that represents our estimates of reconstructed cortical sources. Our model predicts MEG better than the common standard of single-modality encoding models, and it also yields source estimates with higher spatial and temporal fidelity than classic minimum-norm solutions in simulation experiments. We validated the estimated latent sources by showing its strong generalizability across unseen subjects and modalities. Estimated activity in our source space predict electrocorticography (ECoG) better than an ECoG-trained encoding model in an entirely new dataset. By integrating the power of large naturalistic experiments, MEG, fMRI, and encoding models, we propose a practical route towards millisecond-and-millimeter brain mapping.
☆ A Multimodal Approach to SME Credit Scoring Integrating Transaction and Ownership Networks
Small and Medium-sized Enterprises (SMEs) are known to play a vital role in economic growth, employment, and innovation. However, they tend to face significant challenges in accessing credit due to limited financial histories, collateral constraints, and exposure to macroeconomic shocks. These challenges make an accurate credit risk assessment by lenders crucial, particularly since SMEs frequently operate within interconnected firm networks through which default risk can propagate. This paper presents and tests a novel approach for modelling the risk of SME credit, using a unique large data set of SME loans provided by a prominent financial institution. Specifically, our approach employs Graph Neural Networks to predict SME default using multilayer network data derived from common ownership and financial transactions between firms. We show that combining this information with traditional structured data not only improves application scoring performance, but also explicitly models contagion risk between companies. Further analysis shows how the directionality and intensity of these connections influence financial risk contagion, offering a deeper understanding of the underlying processes. Our findings highlight the predictive power of network data, as well as the role of supply chain networks in exposing SMEs to correlated default risk.
☆ Cross-Receiver Generalization for RF Fingerprint Identification via Feature Disentanglement and Adversarial Training
Radio frequency fingerprint identification (RFFI) is a critical technique for wireless network security, leveraging intrinsic hardware-level imperfections introduced during device manufacturing to enable precise transmitter identification. While deep neural networks have shown remarkable capability in extracting discriminative features, their real-world deployment is hindered by receiver-induced variability. In practice, RF fingerprint signals comprise transmitter-specific features as well as channel distortions and receiver-induced biases. Although channel equalization can mitigate channel noise, receiver-induced feature shifts remain largely unaddressed, causing the RFFI models to overfit to receiver-specific patterns. This limitation is particularly problematic when training and evaluation share the same receiver, as replacing the receiver in deployment can cause substantial performance degradation. To tackle this challenge, we propose an RFFI framework robust to cross-receiver variability, integrating adversarial training and style transfer to explicitly disentangle transmitter and receiver features. By enforcing domain-invariant representation learning, our method isolates genuine hardware signatures from receiver artifacts, ensuring robustness against receiver changes. Extensive experiments on multi-receiver datasets demonstrate that our approach consistently outperforms state-of-the-art baselines, achieving up to a 10% improvement in average accuracy across diverse receiver settings.
☆ Design Principles for Sequence Models via Coefficient Dynamics
Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.
☆ HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness
Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model's policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data efficiency.Code is available on Github.
☆ CHUCKLE -- When Humans Teach AI To Learn Emotions The Easy Way
Curriculum learning (CL) structures training from simple to complex samples, facilitating progressive learning. However, existing CL approaches for emotion recognition often rely on heuristic, data-driven, or model-based definitions of sample difficulty, neglecting the difficulty for human perception, a critical factor in subjective tasks like emotion recognition. We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework that leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, under the assumption that clips challenging for humans are similarly hard for machine learning models. Empirical results suggest that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates, thereby enhancing both training efficiency and model robustness.
☆ Task-Level Insights from Eigenvalues across Sequence Models
Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.
☆ The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
☆ A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
☆ Deep Learning to Identify the Spatio-Temporal Cascading Effects of Train Delays in a High-Density Network
The operational efficiency of railway networks, a cornerstone of modern economies, is persistently undermined by the cascading effects of train delays. Accurately forecasting this delay propagation is a critical challenge for real-time traffic management. While recent research has leveraged Graph Neural Networks (GNNs) to model the network structure of railways, a significant gap remains in developing frameworks that provide multi-step autoregressive forecasts at a network-wide scale, while simultaneously offering the live, interpretable explanations needed for decision support. This paper addresses this gap by developing and evaluating a novel XGeoAI framework for live, explainable, multi-step train delay forecasting. The core of this work is a two-stage, autoregressive Graph Attention Network (GAT) model, trained on a real-world dataset covering over 40% of the Dutch railway network. The model represents the system as a spatio-temporal graph of operational events (arrivals and departures) and is enriched with granular features, including platform and station congestion. To test its viability for live deployment, the model is rigorously evaluated using a sequential, k-step-ahead forecasting protocol that simulates real-world conditions where prediction errors can compound. The results demonstrate that while the proposed GATv2 model is challenged on pure error metrics (MAE) by a simpler Persistence baseline, it achieves consistently higher precision in classifying delay events -- a crucial advantage for a reliable decision support tool.
comment: Accepted at SIGSPATIAL 2025 - GeoAI Workshop
Localist LLMs -- A Mathematical Framework for Dynamic Locality Control
We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.
Efficient Bayesian Inference from Noisy Pairwise Comparisons
Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
☆ Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers
Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.
☆ Rate optimal learning of equilibria from data
We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.
☆ Residual-Informed Learning of Solutions to Algebraic Loops
This paper presents a residual-informed machine learning approach for replacing algebraic loops in equation-based Modelica models with neural network surrogates. A feedforward neural network is trained using the residual (error) of the algebraic loop directly in its loss function, eliminating the need for a supervised dataset. This training strategy also resolves the issue of ambiguous solutions, allowing the surrogate to converge to a consistent solution rather than averaging multiple valid ones. Applied to the large-scale IEEE 14-Bus system, our method achieves a 60% reduction in simulation time compared to conventional simulations, while maintaining the same level of accuracy through error control mechanisms.
comment: 16 pages, 16 figures, 5 tables, submitted to IDaS-Schriftenreihe from Hochschule Bielefeld - University of Applied Sciences and Arts (HSBI)
☆ Large Language Model Prompt Datasets: An In-depth Analysis and Insights
A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we--for the first time--have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.
☆ Reliability Sensitivity with Response Gradient
Engineering risk is concerned with the likelihood of failure and the scenarios when it occurs. The sensitivity of failure probability to change in system parameters is relevant to risk-informed decision making. Computing sensitivity is at least one level more difficult than the probability itself, which is already challenged by a large number of input random variables, rare events and implicit nonlinear `black-box' response. Finite difference with Monte Carlo probability estimates is spurious, requiring the number of samples to grow with the reciprocal of step size to suppress estimation variance. Many existing works gain efficiency by exploiting a specific class of input variables, sensitivity parameters, or response in its exact or surrogate form. For general systems, this work presents a theory and associated Monte Carlo strategy for computing sensitivity using response values and gradients with respect to sensitivity parameters. It is shown that the sensitivity at a given response threshold can be expressed via the expectation of response gradient conditional on the threshold. Determining the expectation requires conditioning on the threshold that is a zero-probability event, but it can be resolved by the concept of kernel smoothing. The proposed method offers sensitivity estimates for all response thresholds generated in a single Monte Carlo run. It is investigated in a number of examples featuring sensitivity parameters of different nature. As response gradient becomes increasingly available, it is hoped that this work can provide the basis for embedding sensitivity calculations with reliability in the same Monte Carlo run.
comment: 45 pages, 8 figures. Submitted to Structural Safety (Elsevier) on 5 Oct 2025
☆ Verifying Chain-of-Thought Reasoning via Its Computational Graph
Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.
☆ A Model-Driven Engineering Approach to AI-Powered Healthcare Platforms
Artificial intelligence (AI) has the potential to transform healthcare by supporting more accurate diagnoses and personalized treatments. However, its adoption in practice remains constrained by fragmented data sources, strict privacy rules, and the technical complexity of building reliable clinical systems. To address these challenges, we introduce a model driven engineering (MDE) framework designed specifically for healthcare AI. The framework relies on formal metamodels, domain-specific languages (DSLs), and automated transformations to move from high level specifications to running software. At its core is the Medical Interoperability Language (MILA), a graphical DSL that enables clinicians and data scientists to define queries and machine learning pipelines using shared ontologies. When combined with a federated learning architecture, MILA allows institutions to collaborate without exchanging raw patient data, ensuring semantic consistency across sites while preserving privacy. We evaluate this approach in a multi center cancer immunotherapy study. The generated pipelines delivered strong predictive performance, with support vector machines achieving up to 98.5 percent and 98.3 percent accuracy in key tasks, while substantially reducing manual coding effort. These findings suggest that MDE principles metamodeling, semantic integration, and automated code generation can provide a practical path toward interoperable, reproducible, and trustworthy digital health platforms.
comment: Disclaimer: This manuscript is currently under review at * MDPI Informatics*
☆ Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers
Machine Learning models in finance are highly susceptible to model drift, where predictive performance declines as data distributions shift. This issue is especially acute in developing economies such as those in Central Asia and the Caucasus - including Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan - where frequent and unpredictable macroeconomics shocks destabilize financial data. To the best of our knowledge, this is among the first studies to examine drift mitigation methods on financial datasets from these regions. We investigate the use of synthetic outliers, a largely unexplored approach, to improve model stability against unforeseen shocks. To evaluate effectiveness, we introduce a two-level framework that measures both the extent of performance degradation and the severity of shocks. Our experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model
☆ A unified Bayesian framework for adversarial robustness
The vulnerability of machine learning models to adversarial attacks remains a critical security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. However, these deterministic approaches do not account for uncertainty in the adversary's attack. While stochastic defenses placing a probability distribution on the adversary exist, they often lack statistical rigor and fail to make explicit their underlying assumptions. To resolve these issues, we introduce a formal Bayesian framework that models adversarial uncertainty through a stochastic channel, articulating all probabilistic assumptions. This yields two robustification strategies: a proactive defense enacted during training, aligned with adversarial training, and a reactive defense enacted during operations, aligned with adversarial purification. Several previous defenses can be recovered as limiting cases of our model. We empirically validate our methodology, showcasing the benefits of explicitly modeling adversarial uncertainty.
☆ Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects
Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA's actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at https://goba-attack.github.io/.
☆ Placeit! A Framework for Learning Robot Object Placement Skills
Robotics research has made significant strides in learning, yet mastering basic skills like object placement remains a fundamental challenge. A key bottleneck is the acquisition of large-scale, high-quality data, which is often a manual and laborious process. Inspired by Graspit!, a foundational work that used simulation to automatically generate dexterous grasp poses, we introduce Placeit!, an evolutionary-computation framework for generating valid placement positions for rigid objects. Placeit! is highly versatile, supporting tasks from placing objects on tables to stacking and inserting them. Our experiments show that by leveraging quality-diversity optimization, Placeit! significantly outperforms state-of-the-art methods across all scenarios for generating diverse valid poses. A pick&place pipeline built on our framework achieved a 90% success rate over 120 real-world deployments. This work positions Placeit! as a powerful tool for open-environment pick-and-place tasks and as a valuable engine for generating the data needed to train simulation-based foundation models in robotics.
comment: 8 pages, 8 figures. Draft version
☆ GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis
Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over $5000$ angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.
☆ Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
☆ Zero-shot image privacy classification with Vision-Language Models
While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
comment: 5 pages, 3 figures, 3 tables. This work has been submitted to the ICASSP 2026
☆ Smart navigation of a gravity-driven glider with adjustable centre-of-mass
Artificial gliders are designed to disperse as they settle through a fluid, requiring precise navigation to reach target locations. We show that a compact glider settling in a viscous fluid can navigate by dynamically adjusting its centre-of-mass. Using fully resolved direct numerical simulations (DNS) and reinforcement learning, we find two optimal navigation strategies that allow the glider to reach its target location accurately. These strategies depend sensitively on how the glider interacts with the surrounding fluid. The nature of this interaction changes as the particle Reynolds number Re$_p$ changes. Our results explain how the optimal strategy depends on Re$_p$. At large Re$_p$, the glider learns to tumble rapidly by moving its centre-of-mass as its orientation changes. This generates a large horizontal inertial lift force, which allows the glider to travel far. At small Re$_p$, by contrast, high viscosity hinders tumbling. In this case, the glider learns to adjust its centre-of-mass so that it settles with a steady, inclined orientation that results in a horizontal viscous force. The horizontal range is much smaller than for large Re$_p$, because this viscous force is much smaller than the inertial lift force at large Re$_p$. *These authors contributed equally.
comment: 13 pages, 8 figures
☆ Application of Deep Reinforcement Learning to At-the-Money S&P 500 Options Hedging
This paper explores the application of deep Q-learning to hedging at-the-money options on the S\&P~500 index. We develop an agent based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, trained to simulate hedging decisions without making explicit model assumptions on price dynamics. The agent was trained on historical intraday prices of S\&P~500 call options across years 2004--2024, using a single time series of six predictor variables: option price, underlying asset price, moneyness, time to maturity, realized volatility, and current hedge position. A walk-forward procedure was applied for training, which led to nearly 17~years of out-of-sample evaluation. The performance of the deep reinforcement learning (DRL) agent is benchmarked against the Black--Scholes delta-hedging strategy over the same period. We assess both approaches using metrics such as annualized return, volatility, information ratio, and Sharpe ratio. To test the models' adaptability, we performed simulations across varying market conditions and added constraints such as transaction costs and risk-awareness penalties. Our results show that the DRL agent can outperform traditional hedging methods, particularly in volatile or high-cost environments, highlighting its robustness and flexibility in practical trading contexts. While the agent consistently outperforms delta-hedging, its performance deteriorates when the risk-awareness parameter is higher. We also observed that the longer the time interval used for volatility estimation, the more stable the results.
comment: 35 pages
☆ A PCA-based Data Prediction Method
The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.
☆ Investigating the Impact of Rational Dilated Wavelet Transform on Motor Imagery EEG Decoding with Deep Learning Models
The present study investigates the impact of the Rational Discrete Wavelet Transform (RDWT), used as a plug-in preprocessing step for motor imagery electroencephalographic (EEG) decoding prior to applying deep learning classifiers. A systematic paired evaluation (with/without RDWT) is conducted on four state-of-the-art deep learning architectures: EEGNet, ShallowConvNet, MBEEG\_SENet, and EEGTCNet. This evaluation was carried out across three benchmark datasets: High Gamma, BCI-IV-2a, and BCI-IV-2b. The performance of the RDWT is reported with subject-wise averages using accuracy and Cohen's kappa, complemented by subject-level analyses to identify when RDWT is beneficial. On BCI-IV-2a, RDWT yields clear average gains for EEGTCNet (+4.44 percentage points, pp; kappa +0.059) and MBEEG\_SENet (+2.23 pp; +0.030), with smaller improvements for EEGNet (+2.08 pp; +0.027) and ShallowConvNet (+0.58 pp; +0.008). On BCI-IV-2b, the enhancements observed are modest yet consistent for EEGNet (+0.21 pp; +0.044) and EEGTCNet (+0.28 pp; +0.077). On HGD, average effects are modest to positive, with the most significant gain observed for MBEEG\_SENet (+1.65 pp; +0.022), followed by EEGNet (+0.76 pp; +0.010) and EEGTCNet (+0.54 pp; +0.008). Inspection of the subject material reveals significant enhancements in challenging recordings (e.g., non-stationary sessions), indicating that RDWT can mitigate localized noise and enhance rhythm-specific information. In conclusion, RDWT is shown to be a low-overhead, architecture-aware preprocessing technique that can yield tangible gains in accuracy and agreement for deep model families and challenging subjects.
☆ Characterizing 5G User Throughput via Uncertainty Modeling and Crowdsourced Measurements
Characterizing application-layer user throughput in next-generation networks is increasingly challenging as the higher capacity of the 5G Radio Access Network (RAN) shifts connectivity bottlenecks towards deeper parts of the network. Traditional methods, such as drive tests and operator equipment counters, are costly, limited, or fail to capture end-to-end (E2E) Quality of Service (QoS) and its variability. In this work, we leverage large-scale crowdsourced measurements-including E2E, radio, contextual and network deployment features collected by the user equipment (UE)-to propose an uncertainty-aware and explainable approach for downlink user throughput estimation. We first validate prior 4G methods, improving R^2 by 8.7%, and then extend them to 5G NSA and 5G SA, providing the first benchmarks for 5G crowdsourced datasets. To address the variability of throughput, we apply NGBoost, a model that outputs both point estimates and calibrated confidence intervals, representing its first use in the field of computer communications. Finally, we use the proposed model to analyze the evolution from 4G to 5G SA, and show that throughput bottlenecks move from the RAN to transport and service layers, as seen by E2E metrics gaining importance over radio-related features.
comment: Submitted to IEEE ICC 2026
☆ Incentivizing Time-Aware Fairness in Data Sharing NeurIPS 2025
In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.
comment: Accepted to NeurIPS 2025
☆ Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras
Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.
☆ Prime Implicant Explanations for Reaction Feasibility Prediction ECML
Machine learning models that predict the feasibility of chemical reactions have become central to automated synthesis planning. Despite their predictive success, these models often lack transparency and interpretability. We introduce a novel formulation of prime implicant explanations--also known as minimally sufficient reasons--tailored to this domain, and propose an algorithm for computing such explanations in small-scale reaction prediction tasks. Preliminary experiments demonstrate that our notion of prime implicant explanations conservatively captures the ground truth explanations. That is, such explanations often contain redundant bonds and atoms but consistently capture the molecular attributes that are essential for predicting reaction feasibility.
comment: Presented at AIMLAI workshop at ECMLPKDD 2025
☆ FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning
Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.
comment: 20 pages
☆ IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data
Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.
comment: ACL 2025
☆ Provable Watermarking for Data Poisoning Attacks NeurIPS 2025
In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying dataset ownership or safeguarding private data from unauthorized use. However, these developments have the potential to cause misunderstandings and conflicts, as data poisoning has traditionally been regarded as a security threat to machine learning systems. To address this issue, it is imperative for harmless poisoning generators to claim ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. In this paper, we propose the deployment of watermarking schemes as a solution to this challenge. We introduce two provable and practical watermarking approaches for data poisoning: {\em post-poisoning watermarking} and {\em poisoning-concurrent watermarking}. Our analyses demonstrate that when the watermarking length is $\Theta(\sqrt{d}/\epsilon_w)$ for post-poisoning watermarking, and falls within the range of $\Theta(1/\epsilon_w^2)$ to $O(\sqrt{d}/\epsilon_p)$ for poisoning-concurrent watermarking, the watermarked poisoning dataset provably ensures both watermarking detectability and poisoning utility, certifying the practicality of watermarking under data poisoning attacks. We validate our theoretical findings through experiments on several attacks, models, and datasets.
comment: 42 pages, NeurIPS 2025
☆ Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization
Centralized trajectory optimization in the joint space of multiple robots allows access to a larger feasible space that can result in smoother trajectories, especially while planning in tight spaces. Unfortunately, it is often computationally intractable beyond a very small swarm size. In this paper, we propose Flow-Opt, a learning-based approach towards improving the computational tractability of centralized multi-robot trajectory optimization. Specifically, we reduce the problem to first learning a generative model to sample different candidate trajectories and then using a learned Safety-Filter(SF) to ensure fast inference-time constraint satisfaction. We propose a flow-matching model with a diffusion transformer (DiT) augmented with permutation invariant robot position and map encoders as the generative model. We develop a custom solver for our SF and equip it with a neural network that predicts context-specific initialization. The initialization network is trained in a self-supervised manner, taking advantage of the differentiability of the SF solver. We advance the state-of-the-art in the following respects. First, we show that we can generate trajectories of tens of robots in cluttered environments in a few tens of milliseconds. This is several times faster than existing centralized optimization approaches. Moreover, our approach also generates smoother trajectories orders of magnitude faster than competing baselines based on diffusion models. Second, each component of our approach can be batched, allowing us to solve a few tens of problem instances in a fraction of a second. We believe this is a first such result; no existing approach provides such capabilities. Finally, our approach can generate a diverse set of trajectories between a given set of start and goal locations, which can capture different collision-avoidance behaviors.
☆ Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
☆ Augmented data and neural networks for robust epidemic forecasting: application to COVID-19 in Italy
In this work, we propose a data augmentation strategy aimed at improving the training phase of neural networks and, consequently, the accuracy of their predictions. Our approach relies on generating synthetic data through a suitable compartmental model combined with the incorporation of uncertainty. The available data are then used to calibrate the model, which is further integrated with deep learning techniques to produce additional synthetic data for training. The results show that neural networks trained on these augmented datasets exhibit significantly improved predictive performance. We focus in particular on two different neural network architectures: Physics-Informed Neural Networks (PINNs) and Nonlinear Autoregressive (NAR) models. The NAR approach proves especially effective for short-term forecasting, providing accurate quantitative estimates by directly learning the dynamics from data and avoiding the additional computational cost of embedding physical constraints into the training. In contrast, PINNs yield less accurate quantitative predictions but capture the qualitative long-term behavior of the system, making them more suitable for exploring broader dynamical trends. Numerical simulations of the second phase of the COVID-19 pandemic in the Lombardy region (Italy) validate the effectiveness of the proposed approach.
☆ On the Implicit Adversariality of Catastrophic Forgetting in Deep Continual Learning
Continual learning seeks the human-like ability to accumulate new skills in machine intelligence. Its central challenge is catastrophic forgetting, whose underlying cause has not been fully understood for deep networks. In this paper, we demystify catastrophic forgetting by revealing that the new-task training is implicitly an adversarial attack against the old-task knowledge. Specifically, the new-task gradients automatically and accurately align with the sharp directions of the old-task loss landscape, rapidly increasing the old-task loss. This adversarial alignment is intriguingly counter-intuitive because the sharp directions are too sparsely distributed to align with by chance. To understand it, we theoretically show that it arises from training's low-rank bias, which, through forward and backward propagation, confines the two directions into the same low-dimensional subspace, facilitating alignment. Gradient projection (GP) methods, a representative family of forgetting-mitigating methods, reduce adversarial alignment caused by forward propagation, but cannot address the alignment due to backward propagation. We propose backGP to address it, which reduces forgetting by 10.8% and improves accuracy by 12.7% on average over GP methods.
☆ RepDL: Bit-level Reproducible Deep Learning Training and Inference
Non-determinism and non-reproducibility present significant challenges in deep learning, leading to inconsistent results across runs and platforms. These issues stem from two origins: random number generation and floating-point computation. While randomness can be controlled through deterministic configurations, floating-point inconsistencies remain largely unresolved. To address this, we introduce RepDL, an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments. RepDL achieves this by enforcing correct rounding and order invariance in floating-point computation. The source code is available at https://github.com/microsoft/RepDL .
comment: Originally drafted in 2023
☆ Distributionally robust approximation property of neural networks
The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional $L^p$-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.
☆ Beyond Pairwise Connections: Extracting High-Order Functional Brain Network Structures under Global Constraints NeurIPS
Functional brain network (FBN) modeling often relies on local pairwise interactions, whose limitation in capturing high-order dependencies is theoretically analyzed in this paper. Meanwhile, the computational burden and heuristic nature of current hypergraph modeling approaches hinder end-to-end learning of FBN structures directly from data distributions. To address this, we propose to extract high-order FBN structures under global constraints, and implement this as a Global Constraints oriented Multi-resolution (GCM) FBN structure learning framework. It incorporates 4 types of global constraint (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning FBN structures for 4 distinct levels (sample/subject/group/project) of modeling resolution. Experimental results demonstrate that GCM achieves up to a 30.6% improvement in relative accuracy and a 96.3% reduction in computational time across 5 datasets and 2 task settings, compared to 9 baselines and 10 state-of-the-art methods. Extensive experiments validate the contributions of individual components and highlight the interpretability of GCM. This work offers a novel perspective on FBN structure learning and provides a foundation for interdisciplinary applications in cognitive neuroscience. Code is publicly available on https://github.com/lzhan94swu/GCM.
comment: 33 pages, 10 figures, NeurIPS
☆ Robustness and Regularization in Hierarchical Re-Basin
This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.
comment: Published in 32th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2024
Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization
As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model's essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.
☆ Cross-Representation Benchmarking in Time-Series Electronic Health Records for Clinical Outcome Prediction
Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families--including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams--and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.
☆ Agentic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning
Current knowledge-enhanced large language models (LLMs) rely on static, pre-constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic-KGR, a novel framework enabling co-evolution between LLMs and knowledge graphs (KGs) through multi-round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.
☆ Federated Data Analytics for Cancer Immunotherapy: A Privacy-Preserving Collaborative Platform for Patient Management
Connected health is a multidisciplinary approach focused on health management, prioritizing pa-tient needs in the creation of tools, services, and treatments. This paradigm ensures proactive and efficient care by facilitating the timely exchange of accurate patient information among all stake-holders in the care continuum. The rise of digital technologies and process innovations promises to enhance connected health by integrating various healthcare data sources. This integration aims to personalize care, predict health outcomes, and streamline patient management, though challeng-es remain, particularly in data architecture, application interoperability, and security. Data analytics can provide critical insights for informed decision-making and health co-creation, but solutions must prioritize end-users, including patients and healthcare professionals. This perspective was explored through an agile System Development Lifecycle in an EU-funded project aimed at developing an integrated AI-generated solution for managing cancer patients undergoing immunotherapy. This paper contributes with a collaborative digital framework integrating stakeholders across the care continuum, leveraging federated big data analytics and artificial intelligence for improved decision-making while ensuring privacy. Analytical capabilities, such as treatment recommendations and adverse event predictions, were validated using real-life data, achieving 70%-90% accuracy in a pilot study with the medical partners, demonstrating the framework's effectiveness.
comment: This manuscript is currently under review at * ACM Transactions on Computing for Healthcare (HEALTH)*
☆ Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting
Large language models (LLMs) often face a trade-off in post-training: improvements on specialized domains frequently come at the expense of general capabilities. Existing solutions attempt to mitigate this tension via regularization, selective parameter updates, or data-centric replay, but each imposes significant costs in computation, data access, or adaptability. Recent work has shown that training signals can be compressed to subsets of logits without severe accuracy loss, suggesting a path toward efficient adaptation. However, naive truncation destabilizes optimization and exacerbates forgetting. We introduce Logits Replay + MoClip, a two-stage framework that compresses supervision in the logit space and stabilizes optimization at the update level. In Stage 0, we record dynamic Top-K token subsets that cover a probability threshold, always including the gold label. In Stage 1, we replay these compact subsets to compute exact renormalized losses, avoiding full softmax computation and implicitly regularizing. To ensure stability, we design MoClip, an optimizer that caps gradient-momentum rotation and applies an arctan2-based rescaling of updates. Empirically, our method improves domain performance on Communication Technology (CT) and NL2SQL tasks while mitigating forgetting on general benchmarks (MMLU, BBH, GPQA, MATH), and reduces training cost by over 40%. Together, these contributions offer a scalable, architecture-agnostic path for domain adaptation of LLMs without sacrificing generalization.
☆ Score-Based Density Estimation from Pairwise Comparisons
We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner's score via score-matching. This allows estimating the target by `de-tempering' the estimated winner density's score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.
comment: 32 pages, 26 figures
☆ Training Feature Attribution for Vision Models
Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
☆ PAC Reasoning: Controlling the Performance Loss for Efficient Reasoning
Large reasoning models (LRMs) have achieved remarkable progress in complex problem-solving tasks. Despite this success, LRMs typically suffer from high computational costs during deployment, highlighting a need for efficient inference. A popular direction of efficiency improvement is to switch the LRM between thinking and nonthinking modes dynamically. However, such approaches often introduce additional reasoning errors and lack statistical guarantees for the performance loss, which are critical for high-stakes applications. In this work, we propose Probably Approximately Correct (PAC) reasoning that controls the performance loss under the user-specified performance loss tolerance. In particular, we construct an upper confidence bound on the performance loss, formulated as a monotone function of the uncertainty score, and subsequently determine a threshold for switching to the nonthinking model. Theoretically, using the threshold to switch between the thinking and nonthinking modes ensures bounded performance loss in a distribution-free manner. Our comprehensive experiments on reasoning benchmarks show that the proposed method can save computational budgets and control the user-specified performance loss.
☆ Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback
We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.
☆ On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning
While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.
☆ A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans
Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.
☆ MemLoss: Enhancing Adversarial Training with Recycling Adversarial Examples
In this paper, we propose a new approach called MemLoss to improve the adversarial training of machine learning models. MemLoss leverages previously generated adversarial examples, referred to as 'Memory Adversarial Examples,' to enhance model robustness and accuracy without compromising performance on clean data. By using these examples across training epochs, MemLoss provides a balanced improvement in both natural accuracy and adversarial robustness. Experimental results on multiple datasets, including CIFAR-10, demonstrate that our method achieves better accuracy compared to existing adversarial training methods while maintaining strong robustness against attacks.
comment: 24 pages
☆ AdaPM: a Partial Momentum Algorithm for LLM Training
In the training of large language models, momentum is widely used and often demonstrated to achieve significant acceleration. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, an adaptive training strategy that leverages partial momentum to implement a memory-efficient optimizer. To this end, AdaPM utilizes a non-uniform momentum design: for most blocks, full momentum is not necessary to preserve the performance of the optimization. In the momentum design of AdaPM, to mitigate the bias and performance loss caused by partial momentum, we enhance the partial momentum by a bias correction technique. Empirically, we verify that our approach reduces memory by over $90\%$ in momentum while maintaining both efficiency and performance for pretraining various language models ranging from 60M to 1.5B, as well as for supervised fine-tuning and RLHF. AdaPM can further reduce memory by up to $95\%$ in optimizer states by combining the memory-efficient technique on the second-order statistic, saving over $30\%$ GPU hours for pretraining GPT-2 1.5B.
☆ When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
Learning from demonstrations enables experts to teach robots complex tasks using interfaces such as kinesthetic teaching, joystick control, and sim-to-real transfer. However, these interfaces often constrain the expert's ability to demonstrate optimal behavior due to indirect control, setup restrictions, and hardware safety. For example, a joystick can move a robotic arm only in a 2D plane, even though the robot operates in a higher-dimensional space. As a result, the demonstrations collected by constrained experts lead to suboptimal performance of the learned policies. This raises a key question: Can a robot learn a better policy than the one demonstrated by a constrained expert? We address this by allowing the agent to go beyond direct imitation of expert actions and explore shorter and more efficient trajectories. We use the demonstrations to infer a state-only reward signal that measures task progress, and self-label reward for unknown states using temporal interpolation. Our approach outperforms common imitation learning in both sample efficiency and task completion time. On a real WidowX robotic arm, it completes the task in 12 seconds, 10x faster than behavioral cloning, as shown in real-robot videos on https://sites.google.com/view/constrainedexpert .
☆ Neural Codecs as Biosignal Tokenizers
Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.
comment: 25 pages, 7 figures, 10 tables, currently under peer review
☆ FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.
comment: 5 pages, 5 figures
☆ Improving Anomaly Detection in Industrial Time Series: The Role of Segmentation and Heterogeneous Ensemble
Concerning machine learning, segmentation models can identify state changes within time series, facilitating the detection of transitions between normal and anomalous conditions. Specific techniques such as Change Point Detection (CPD), particularly algorithms like ChangeFinder, have been successfully applied to segment time series and improve anomaly detection by reducing temporal uncertainty, especially in multivariate environments. In this work, we explored how the integration of segmentation techniques, combined with a heterogeneous ensemble, can enhance anomaly detection in an industrial production context. The results show that applying segmentation as a pre-processing step before selecting heterogeneous ensemble algorithms provided a significant advantage in our case study, improving the AUC-ROC metric from 0.8599 (achieved with a PCA and LSTM ensemble) to 0.9760 (achieved with Random Forest and XGBoost). This improvement is imputable to the ability of segmentation to reduce temporal ambiguity and facilitate the learning process of supervised algorithms. In our future work, we intend to assess the benefit of introducing weighted features derived from the study of change points, combined with segmentation and the use of heterogeneous ensembles, to further optimize model performance in early anomaly detection.
comment: Conference paper. Under publication process at CODIT 2025
☆ MCMC: Bridging Rendering, Optimization and Generative AI SIGGRAPH
Generative artificial intelligence (AI) has made unprecedented advances in vision language models over the past two years. During the generative process, new samples (images) are generated from an unknown high-dimensional distribution. Markov Chain Monte Carlo (MCMC) methods are particularly effective in drawing samples from such complex, high-dimensional distributions. This makes MCMC methods an integral component for models like EBMs, ensuring accurate sample generation. Gradient-based optimization is at the core of modern generative models. The update step during the optimization forms a Markov chain where the new update depends only on the current state. This allows exploration of the parameter space in a memoryless manner, thus combining the benefits of gradient-based optimization and MCMC sampling. MCMC methods have shown an equally important role in physically based rendering where complex light paths are otherwise quite challenging to sample from simple importance sampling techniques. A lot of research is dedicated towards bringing physical realism to samples (images) generated from diffusion-based generative models in a data-driven manner, however, a unified framework connecting these techniques is still missing. In this course, we take the first steps toward understanding each of these components and exploring how MCMC could potentially serve as a bridge, linking these closely related areas of research. Our course aims to provide necessary theoretical and practical tools to guide students, researchers and practitioners towards the common goal of generative physically based rendering. All Jupyter notebooks with demonstrations associated to this tutorial can be found on the project webpage: https://sinbag.github.io/mcmc/
comment: SIGGRAPH Asia 2024 Courses. arXiv admin note: text overlap with arXiv:2208.11970 by other authors
☆ Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition
Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.
comment: 13 pages, 1 figure
☆ MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
comment: 4 pages, 4 figures, 2 tables
☆ Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation
Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.
comment: Accepted to the EMNLP 2025 Workshop on Multilingual Representation Learning (MRL)
☆ Spatio-Temporal Graph Convolutional Networks for EV Charging Demand Forecasting Using Real-World Multi-Modal Data Integration
Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as electric vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces TW-GCN, a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States (U.S.). We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest EV infrastructure company in the U.S. to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying lag horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, population, and local demand variability shape model performance. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning, supporting both sustainable mobility transitions and resilient grid management.
☆ Cost-Efficient Long Code Translation using LLMs while Leveraging Identifier Replacements
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
☆ MAKO: Meta-Adaptive Koopman Operators for Learning-based Model Predictive Control of Parametrically Uncertain Nonlinear Systems
In this work, we propose a meta-learning-based Koopman modeling and predictive control approach for nonlinear systems with parametric uncertainties. An adaptive deep meta-learning-based modeling approach, called Meta Adaptive Koopman Operator (MAKO), is proposed. Without knowledge of the parametric uncertainty, the proposed MAKO approach can learn a meta-model from a multi-modal dataset and efficiently adapt to new systems with previously unseen parameter settings by using online data. Based on the learned meta Koopman model, a predictive control scheme is developed, and the stability of the closed-loop system is ensured even in the presence of previously unseen parameter settings. Through extensive simulations, our proposed approach demonstrates superior performance in both modeling accuracy and control efficacy as compared to competitive baselines.
☆ Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach
Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9\% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.
☆ Auto-scaling Continuous Memory for GUI Agent
We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
☆ Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels
Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.
☆ Convergence of optimizers implies eigenvalues filtering at equilibrium
Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.
☆ The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.
☆ The Environmental Impacts of Machine Learning Training Keep Rising Evidencing Rebound Effect
Recent Machine Learning (ML) approaches have shown increased performance on benchmarks but at the cost of escalating computational demands. Hardware, algorithmic and carbon optimizations have been proposed to curb energy consumption and environmental impacts. Can these strategies lead to sustainable ML model training? Here, we estimate the environmental impacts associated with training notable AI systems over the last decade, including Large Language Models, with a focus on the life cycle of graphics cards. Our analysis reveals two critical trends: First, the impacts of graphics cards production have increased steadily over this period; Second, energy consumption and environmental impacts associated with training ML models have increased exponentially, even when considering reduction strategies such as location shifting to places with less carbon intensive electricity mixes. Optimization strategies do not mitigate the impacts induced by model training, evidencing rebound effect. We show that the impacts of hardware must be considered over the entire life cycle rather than the sole use phase in order to avoid impact shifting. Our study demonstrates that increasing efficiency alone cannot ensure sustainability in ML. Mitigating the environmental impact of AI also requires reducing AI activities and questioning the scale and frequency of resource-intensive training.
comment: arXiv admin note: text overlap with arXiv:2412.17376
☆ RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows
State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
☆ MagicDock: Toward Docking-oriented De Novo Ligand Design via Gradient Inversion
De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the \textbf{Pseudo De Novo}, \textbf{Limited Docking Modeling}, and \textbf{Inflexible Ligand Type}. To address these issues, we propose MagicDock, a forward-looking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well-designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the docking process, leveraging learnable 3D point-cloud representations to precisely capture binding details, thereby ensuring that the generated ligands preserve docking validity through direct and interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide rigorous theoretical guarantees for each component of MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 27.1\% and 11.7\% over SOTA baselines specialized for protein or molecule ligand design, respectively.
comment: 52 pages, 14 figures, 12 tables
☆ Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference
Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.
☆ Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.
☆ LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data
Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.
comment: Accepted by 18th ACM Workshop on Artificial Intelligence and Security (AISec'25)
☆ SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions
Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.
☆ PlatformX: An End-to-End Transferable Platform for Energy-Efficient Neural Architecture Search
Hardware-Aware Neural Architecture Search (HW-NAS) has emerged as a powerful tool for designing efficient deep neural networks (DNNs) tailored to edge devices. However, existing methods remain largely impractical for real-world deployment due to their high time cost, extensive manual profiling, and poor scalability across diverse hardware platforms with complex, device-specific energy behavior. In this paper, we present PlatformX, a fully automated and transferable HW-NAS framework designed to overcome these limitations. PlatformX integrates four key components: (i) an energy-driven search space that expands conventional NAS design by incorporating energy-critical configurations, enabling exploration of high-efficiency architectures; (ii) a transferable kernel-level energy predictor across devices and incrementally refined with minimal on-device samples; (iii) a Pareto-based multi-objective search algorithm that balances energy and accuracy to identify optimal trade-offs; and (iv) a high-resolution runtime energy profiling system that automates on-device power measurement using external monitors without human intervention. We evaluate PlatformX across multiple mobile platforms, showing that it significantly reduces search overhead while preserving accuracy and energy fidelity. It identifies models with up to 0.94 accuracy or as little as 0.16 mJ per inference, both outperforming MobileNet-V2 in accuracy and efficiency. Code and tutorials are available at github.com/amai-gsu/PlatformX.
☆ Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search
While researchers have made significant progress in enabling large language models (LLMs) to perform multi-step planning, LLMs struggle to ensure that those plans align with high-level user intent and satisfy symbolic constraints, especially in complex, multi-step domains. Existing reasoning approaches such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and verifier-augmented methods, expand the search space but often yield infeasible actions or hallucinated steps. To overcome these limitations, we propose Constraints-of-Thought (Const-o-T), a framework that provides a structured prior that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths. Each reasoning step is represented as an (intent, constraint) pair, which serves both to compress the search space and enforce validity. Unlike prior methods that merely generate reasoning traces or validate outputs post hoc, Const-o-T uses (intent, constraint)pairs to actively focus the search toward feasible and meaningful plans. We integrate Const-o-T into MCTS using a structured representation of intent-constraint pairs constraints prune infeasible branches and guide exploration toward semantically valid actions, improving planning efficiency and verifiable decision-making. We demonstrate across three domains Risk game, CAD code generation, and arithmetic reasoning that our approach outperforms baselines, yielding higher accuracy and stronger structural alignment. Our contribution is to demonstrate that Const-of-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning with LLMs.
☆ FedL2T: Personalized Federated Learning with Two-Teacher Distillation for Seizure Prediction
The training of deep learning models in seizure prediction requires large amounts of Electroencephalogram (EEG) data. However, acquiring sufficient labeled EEG data is difficult due to annotation costs and privacy constraints. Federated Learning (FL) enables privacy-preserving collaborative training by sharing model updates instead of raw data. However, due to the inherent inter-patient variability in real-world scenarios, existing FL-based seizure prediction methods struggle to achieve robust performance under heterogeneous client settings. To address this challenge, we propose FedL2T, a personalized federated learning framework that leverages a novel two-teacher knowledge distillation strategy to generate superior personalized models for each client. Specifically, each client simultaneously learns from a globally aggregated model and a dynamically assigned peer model, promoting more direct and enriched knowledge exchange. To ensure reliable knowledge transfer, FedL2T employs an adaptive multi-level distillation strategy that aligns both prediction outputs and intermediate feature representations based on task confidence. In addition, a proximal regularization term is introduced to constrain personalized model updates, thereby enhancing training stability. Extensive experiments on two EEG datasets demonstrate that FedL2T consistently outperforms state-of-the-art FL methods, particularly under low-label conditions. Moreover, FedL2T exhibits rapid and stable convergence toward optimal performance, thereby reducing the number of communication rounds and associated overhead. These results underscore the potential of FedL2T as a reliable and personalized solution for seizure prediction in privacy-sensitive healthcare scenarios.
☆ Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation
AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.
☆ Diagnosing and Mitigating System Bias in Self-Rewarding RL
Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.
☆ Learning Regularizers: Learning Optimizers that can Regularize
Learned Optimizers (LOs), a type of Meta-learning, have gained traction due to their ability to be parameterized and trained for efficient optimization. Traditional gradient-based methods incorporate explicit regularization techniques such as Sharpness-Aware Minimization (SAM), Gradient-norm Aware Minimization (GAM), and Gap-guided Sharpness-Aware Minimization (GSAM) to enhance generalization and convergence. In this work, we explore a fundamental question: \textbf{Can regularizers be learned?} We empirically demonstrate that LOs can be trained to learn and internalize the effects of traditional regularization techniques without explicitly applying them to the objective function. We validate this through extensive experiments on standard benchmarks (including MNIST, FMNIST, CIFAR and Neural Networks such as MLP, MLP-Relu and CNN), comparing LOs trained with and without access to explicit regularizers. Regularized LOs consistently outperform their unregularized counterparts in terms of test accuracy and generalization. Furthermore, we show that LOs retain and transfer these regularization effects to new optimization tasks by inherently seeking minima similar to those targeted by these regularizers. Our results suggest that LOs can inherently learn regularization properties, \textit{challenging the conventional necessity of explicit optimizee loss regularization.
☆ HiBBO: HiPPO-based Space Consistency for High-dimensional Bayesian Optimisation
Bayesian Optimisation (BO) is a powerful tool for optimising expensive blackbox functions but its effectiveness diminishes in highdimensional spaces due to sparse data and poor surrogate model scalability While Variational Autoencoder (VAE) based approaches address this by learning low-dimensional latent representations the reconstructionbased objective function often brings the functional distribution mismatch between the latent space and original space leading to suboptimal optimisation performance In this paper we first analyse the reason why reconstructiononly loss may lead to distribution mismatch and then propose HiBBO a novel BO framework that introduces the space consistency into the latent space construction in VAE using HiPPO - a method for longterm sequence modelling - to reduce the functional distribution mismatch between the latent space and original space Experiments on highdimensional benchmark tasks demonstrate that HiBBO outperforms existing VAEBO methods in convergence speed and solution quality Our work bridges the gap between high-dimensional sequence representation learning and efficient Bayesian Optimisation enabling broader applications in neural architecture search materials science and beyond.
☆ Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation
Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI); however, the costs associated with data annotation and model training remain significant. A fundamental objective of AI research is to achieve robust generalization with limited-resource data. This survey employs agnostic active sampling theory within the Probably Approximately Correct (PAC) framework to analyze the generalization error and label complexity associated with learning from low-resource data in both model-agnostic supervised and unsupervised settings. Based on this analysis, we investigate a suite of optimization strategies tailored for low-resource data learning, including gradient-informed optimization, meta-iteration optimization, geometry-aware optimization, and LLMs-powered optimization. Furthermore, we provide a comprehensive overview of multiple learning paradigms that can benefit from low-resource data, including domain transfer, reinforcement feedback, and hierarchical structure modeling. Finally, we conclude our analysis and investigation by summarizing the key findings and highlighting their implications for learning with low-resource data.
comment: Accepted by ACM Computing Surveys
☆ Denoised Diffusion for Object-Focused Image Augmentation
Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
☆ When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement Approach
Text-attributed graphs (TAGs) have emerged as a powerful representation that combines structural connections with fine-grained semantics, supporting a wide range of data-centric applications. However, the performance of graph neural networks (GNNs) on TAGs is highly sensitive to input quality. Our empirical study shows that both traditional GNNs and LLM-enhanced GNNs suffer significant degradation across nine representative scenarios of sparsity, noise, and imbalance, highlighting graph quality as a critical bottleneck. Existing approaches mainly focus on improving model architectures, while neglecting systematic optimization of TAG data itself, leading to limited effectiveness in practice. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework that treats graph quality control as a first-class, data-centric problem. LAGA integrates four collaborative agents-detection, planning, action, and evaluation-into an automated closed loop. At its core, the action agent employs a dual-encoder and tri-objective design to capture complementary information across modalities and perform holistic graph quality enhancement. Experiments across nine scenarios show that LAGA improves graph quality and achieves state-of-the-art performance across various tasks and backbones, validating data-centric quality optimization as key to reliable TAGs and robust graph learning.
comment: 12 pages, 7figures
☆ Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection
Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a uniffed module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid conffguration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end ffnetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufffcient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our 2-step model achieves the same structural accuracy as state-of-the-art 200-step diffusion baselines, delivering approximately 10 times faster wall-clock speed while guaranteeing physical validity.
☆ Variability Aware Recursive Neural Network (VARNN): A Residual-Memory Model for Capturing Temporal Deviation in Sequence Regression Modeling
Real-world time series data exhibit non-stationary behavior, regime shifts, and temporally varying noise (heteroscedastic) that degrade the robustness of standard regression models. We introduce the Variability-Aware Recursive Neural Network (VARNN), a novel residual-aware architecture for supervised time-series regression that learns an explicit error memory from recent prediction residuals and uses it to recalibrate subsequent predictions. VARNN augments a feed-forward predictor with a learned error-memory state that is updated from residuals over a short context steps as a signal of variability and drift, and then conditions the final prediction at the current time step. Across diverse dataset domains, appliance energy, healthcare, and environmental monitoring, experimental results demonstrate VARNN achieves superior performance and attains lower test MSE with minimal computational overhead over static, dynamic, and recurrent baselines. Our findings show that the VARNN model offers robust predictions under a drift and volatility environment, highlighting its potential as a promising framework for time-series learning.
☆ Bi-level Meta-Policy Control for Dynamic Uncertainty Calibration in Evidential Deep Learning
Traditional Evidence Deep Learning (EDL) methods rely on static hyperparameter for uncertainty calibration, limiting their adaptability in dynamic data distributions, which results in poor calibration and generalization in high-risk decision-making tasks. To address this limitation, we propose the Meta-Policy Controller (MPC), a dynamic meta-learning framework that adjusts the KL divergence coefficient and Dirichlet prior strengths for optimal uncertainty modeling. Specifically, MPC employs a bi-level optimization approach: in the inner loop, model parameters are updated through a dynamically configured loss function that adapts to the current training state; in the outer loop, a policy network optimizes the KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing prediction accuracy and uncertainty quality. Unlike previous methods with fixed priors, our learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics. Extensive experimental results show that MPC significantly enhances the reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.
☆ MATT-CTR: Unleashing a Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths
Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.
comment: 10 pages, 4 figures, 2 tables
☆ RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation NeurIPS 2025
Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.
comment: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
☆ Mirror Flow Matching with Heavy-Tailed Priors for Generative Modeling on Convex Domains
We study generative modeling on convex domains using flow matching and mirror maps, and identify two fundamental challenges. First, standard log-barrier mirror maps induce heavy-tailed dual distributions, leading to ill-posed dynamics. Second, coupling with Gaussian priors performs poorly when matching heavy-tailed targets. To address these issues, we propose Mirror Flow Matching based on a \emph{regularized mirror map} that controls dual tail behavior and guarantees finite moments, together with coupling to a Student-$t$ prior that aligns with heavy-tailed targets and stabilizes training. We provide theoretical guarantees, including spatial Lipschitzness and temporal regularity of the velocity field, Wasserstein convergence rates for flow matching with Student-$t$ priors and primal-space guarantees for constrained generation, under $\varepsilon$-accurate learned velocity fields. Empirically, our method outperforms baselines in synthetic convex-domain simulations and achieves competitive sample quality on real-world constrained generative tasks.
☆ AB-PINNs: Adaptive-Basis Physics-Informed Neural Networks for Residual-Driven Domain Decomposition
We introduce adaptive-basis physics-informed neural networks (AB-PINNs), a novel approach to domain decomposition for training PINNs in which existing subdomains dynamically adapt to the intrinsic features of the unknown solution. Drawing inspiration from classical mesh refinement techniques, we also modify the domain decomposition on-the-fly throughout training by introducing new subdomains in regions of high residual loss, thereby providing additional expressive power where the solution of the differential equation is challenging to represent. Our flexible approach to domain decomposition is well-suited for multiscale problems, as different subdomains can learn to capture different scales of the underlying solution. Moreover, the ability to introduce new subdomains during training helps prevent convergence to unwanted local minima and can reduce the need for extensive hyperparameter tuning compared to static domain decomposition approaches. Throughout, we present comprehensive numerical results which demonstrate the effectiveness of AB-PINNs at solving a variety of complex multiscale partial differential equations.
☆ Simple and Robust Forecasting of Spatiotemporally Correlated Small Earth Data with A Tabular Foundation Model
Small Earth data are geoscience observations with limited short-term monitoring variability, providing sparse but meaningful measurements, typically exhibiting spatiotemporal correlations. Spatiotemporal forecasting on such data is crucial for understanding geoscientific processes despite their small scale. However, conventional deep learning models for spatiotemporal forecasting requires task-specific training for different scenarios. Foundation models do not need task-specific training, but they often exhibit forecasting bias toward the global mean of the pretraining distribution. Here we propose a simple and robust approach for spatiotemporally correlated small Earth data forecasting. The essential idea is to characterize and quantify spatiotemporal patterns of small Earth data and then utilize tabular foundation models for accurate forecasting across different scenarios. Comparative results across three typical scenarios demonstrate that our forecasting approach achieves superior accuracy compared to the graph deep learning model (T-GCN) and tabular foundation model (TabPFN) in the majority of instances, exhibiting stronger robustness.
☆ PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
comment: 23 pages
☆ A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization
The representer theorem is a cornerstone of kernel methods, which aim to estimate latent functions in reproducing kernel Hilbert spaces (RKHSs) in a nonparametric manner. Its significance lies in converting inherently infinite-dimensional optimization problems into finite-dimensional ones over dual coefficients, thereby enabling practical and computationally tractable algorithms. In this paper, we address the problem of estimating the latent triggering kernels--functions that encode the interaction structure between events--for linear multivariate Hawkes processes based on observed event sequences within an RKHS framework. We show that, under the principle of penalized least squares minimization, a novel form of representer theorem emerges: a family of transformed kernels can be defined via a system of simultaneous integral equations, and the optimal estimator of each triggering kernel is expressed as a linear combination of these transformed kernels evaluated at the data points. Remarkably, the dual coefficients are all analytically fixed to unity, obviating the need to solve a costly optimization problem to obtain the dual coefficients. This leads to a highly efficient estimator capable of handling large-scale data more effectively than conventional nonparametric approaches. Empirical evaluations on synthetic datasets reveal that the proposed method attains competitive predictive accuracy while substantially improving computational efficiency over existing state-of-the-art kernel method-based estimators.
comment: Under review
☆ Velocity and Density-Aware RRI Analysis and Optimization for AoI Minimization in IoV SPS
Addressing the problem of Age of Information (AoI) deterioration caused by packet collisions and vehicle speed-related channel uncertainties in Semi-Persistent Scheduling (SPS) for the Internet of Vehicles (IoV), this letter proposes an optimization approach based on Large Language Models (LLM) and Deep Deterministic Policy Gradient (DDPG). First, an AoI calculation model influenced by vehicle speed, vehicle density, and Resource Reservation Interval (RRI) is established, followed by the design of a dual-path optimization scheme. The DDPG is guided by the state space and reward function, while the LLM leverages contextual learning to generate optimal parameter configurations. Experimental results demonstrate that LLM can significantly reduce AoI after accumulating a small number of exemplars without requiring model training, whereas the DDPG method achieves more stable performance after training.
comment: This paper has been submitted to IEEE Communications Letters
☆ A Frequency-Domain Analysis of the Multi-Armed Bandit Problem: A New Perspective on the Exploration-Exploitation Trade-off
The stochastic multi-armed bandit (MAB) problem is one of the most fundamental models in sequential decision-making, with the core challenge being the trade-off between exploration and exploitation. Although algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling, along with their regret theories, are well-established, existing analyses primarily operate from a time-domain and cumulative regret perspective, struggling to characterize the dynamic nature of the learning process. This paper proposes a novel frequency-domain analysis framework, reformulating the bandit process as a signal processing problem. Within this framework, the reward estimate of each arm is viewed as a spectral component, with its uncertainty corresponding to the component's frequency, and the bandit algorithm is interpreted as an adaptive filter. We construct a formal Frequency-Domain Bandit Model and prove the main theorem: the confidence bound term in the UCB algorithm is equivalent in the frequency domain to a time-varying gain applied to uncertain spectral components, a gain inversely proportional to the square root of the visit count. Based on this, we further derive finite-time dynamic bounds concerning the exploration rate decay. This theory not only provides a novel and intuitive physical interpretation for classical algorithms but also lays a rigorous theoretical foundation for designing next-generation algorithms with adaptive parameter adjustment.
comment: 6 pages
☆ Gradient-Guided Furthest Point Sampling for Robust Training Set Selection
Smart training set selections procedures enable the reduction of data needs and improves predictive robustness in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy-system (Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Compared to FPS and uniform sampling, our numerical results indicate superior data efficiency and robustness when using GGFPS. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to two fold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinksi-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.
comment: 18 pages, 18 figures, journal article
☆ Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning
While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.
comment: 12 pages, 5 figures
☆ An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs
We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.
☆ GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
comment: 31 pages, 6 figures
♻ ☆ Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation
Large language models (LLMs) are highly sensitive to even small amounts of unsafe training data, making effective detection and filtering essential for trustworthy model development. Current state-of-the-art (SOTA) detection approaches primarily rely on moderation classifiers, which require significant computation overhead for training and are limited to predefined taxonomies. In this work, we explore data attribution approaches that measure the similarity between individual training samples and a small set of unsafe target examples, based on data representations such as hidden states or gradients. We identify a key limitation in existing methods: unsafe target texts contain both critical tokens that make them unsafe and neutral tokens (e.g., stop words or benign facts) that are necessary to form fluent language, and the latter of which makes the overall representations ``noisy'' for the purpose of detecting unsafe training data. To address this challenge, we propose Denoised Representation Attribution (DRA), a novel representation-based data attribution approach that denoises training and target representations for unsafe data detection. Across tasks of filtering jailbreaks and detecting gender bias, the proposed approach leads to significant improvement for data attribution methods, outperforming SOTA methods that are mostly based on moderation classifiers.
comment: 14 pages
♻ ☆ CarbonX: An Open-Source Tool for Computational Decarbonization Using Time Series Foundation Models
Computational decarbonization aims to reduce carbon emissions in computing and societal systems such as data centers, transportation, and built environments. This requires accurate, fine-grained carbon intensity forecasts, yet existing tools have several key limitations: (i) they require grid-specific electricity mix data, restricting use where such information is unavailable; (ii) they depend on separate grid-specific models that make it challenging to provide global coverage; and (iii) they provide forecasts without uncertainty estimates, limiting reliability for downstream carbon-aware applications. In this paper, we present CarbonX, an open-source tool that leverages Time Series Foundation Models (TSFMs) for a range of decarbonization tasks. CarbonX utilizes the versatility of TSFMs to provide strong performance across multiple tasks, such as carbon intensity forecasting and imputation, and across diverse grids. Using only historical carbon intensity data and a single general model, our tool achieves a zero-shot forecasting Mean Absolute Percentage Error (MAPE) of 15.82% across 214 grids worldwide. Across 13 benchmark grids, CarbonX performance is comparable with the current state-of-the-art, with an average MAPE of 9.59% and tail forecasting MAPE of 16.54%, while also providing prediction intervals with 95% coverage. CarbonX can provide forecasts for up to 21 days with minimal accuracy degradation. Further, when fully fine-tuned, CarbonX outperforms the statistical baselines by 1.2--3.9X on the imputation task. Overall, these results demonstrate that CarbonX can be used easily on any grid with limited data and still deliver strong performance, making it a practical tool for global-scale decarbonization.
comment: Update: Corrected PDF rendering error on page 10 (caption of Figure 5 was previously overlapping with paper text)
♻ ☆ Untangling Component Imbalance in Hybrid Linear Attention Conversion Methods
Transformers' quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.
♻ ☆ Orthogonal Representation Learning for Estimating Causal Quantities
End-to-end representation learning has become a powerful tool for estimating causal quantities from high-dimensional observational data, but its efficiency remained unclear. Here, we face a central tension: End-to-end representation learning methods often work well in practice but lack asymptotic optimality in the form of the quasi-oracle efficiency. In contrast, two-stage Neyman-orthogonal learners provide such a theoretical optimality property but do not explicitly benefit from the strengths of representation learning. In this work, we step back and ask two research questions: (1) When do representations strengthen existing Neyman-orthogonal learners? and (2) Can a balancing constraint - commonly proposed technique in the representation learning literature - provide improvements to Neyman-orthogonality? We address these two questions through our theoretical and empirical analysis, where we introduce a unifying framework that connects representation learning with Neyman-orthogonal learners (namely, OR-learners). In particular, we show that, under the low-dimensional manifold hypothesis, the OR-learners can strictly improve the estimation error of the standard Neyman-orthogonal learners. At the same time, we find that the balancing constraint requires an additional inductive bias and cannot generally compensate for the lack of Neyman-orthogonality of the end-to-end approaches. Building on these insights, we offer guidelines for how users can effectively combine representation learning with the classical Neyman-orthogonal learners to achieve both practical performance and theoretical guarantees.
♻ ☆ On the Interpolation Effect of Score Smoothing in Diffusion Models
Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we study the hypothesis that such creativity arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional subspace, we show theoretically how regularized two-layer ReLU neural networks tend to learn approximately a smoothed version of the empirical score function, and further probe the interplay between score smoothing and the denoising dynamics with analytical solutions and numerical experiments. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate the training data along their subspace while avoiding full memorization. Moreover, we present experimental evidence that learning score functions with neural networks indeed induces a score smoothing effect, including in simple nonlinear settings and without explicit regularization.
♻ ☆ Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes
We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function $F$ is globally convex or non-convex whose gradient is $\gamma$-H\"{o}lder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB: $\min_{s\leq t} \|\nabla F(w_s)\|^2 = o(t^{p-1})$ for non-convex objectives and $F(w_{\tau \wedge t}) - F_* = o(t^{2\gamma/(1+\gamma) \cdot \max(p-1,-2p+1)-\eps})$ for $\beta \in (0, 1)$, $\tau := \inf \{ t > 0 : F(w_t) = F_*\}$, and $\min_{s \leq t} F(w_s) - F_* = o(t^{p-1})$ for convex objectives $F$ whose minimum is $F_*$. In addition, we proved that SHB with constant momentum parameter $\beta \in (0, 1)$ attains a convergence rate of $F(w_t) - F_* = O(t^{\max(p-1,-2p+1)} \log^2 \frac{t}{\delta})$ with probability at least $1-\delta$ when $F$ is convex and $\gamma = 1$ and step size $\alpha_t = \Theta(t^{-p})$ with $p \in (\frac{1}{2}, 1)$.
♻ ☆ Continual Adapter Tuning with Semantic Shift Compensation for Class-Incremental Learning CVPR 2024
Class-incremental learning (CIL) aims to enable models to continuously learn new classes while overcoming catastrophic forgetting. The introduction of pre-trained models has brought new tuning paradigms to CIL. In this paper, we revisit different parameter-efficient tuning (PET) methods within the context of continual learning. We observe that adapter tuning demonstrates superiority over prompt-based methods, even without parameter expansion in each learning session. Motivated by this, we propose incrementally tuning the shared adapter without imposing parameter update constraints, enhancing the learning capacity of the backbone. Additionally, we employ feature sampling from stored prototypes to retrain a unified classifier, further improving its performance. We estimate the semantic shift of old prototypes without access to past samples and update stored prototypes session by session. Our proposed method eliminates model expansion and avoids retaining any image samples. It surpasses previous pre-trained model-based CIL methods and demonstrates remarkable continual learning capabilities. Experimental results on five CIL benchmarks validate the effectiveness of our approach, achieving state-of-the-art (SOTA) performance.
comment: Journal extension of SSIAT (CVPR 2024)
♻ ☆ On Task Vectors and Gradients
Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
comment: 10 pages of main paper, 5 figures
♻ ☆ A Digital Twin for Diesel Engines: Operator-infused Physics-Informed Neural Networks with Transfer Learning for Engine Health Monitoring
Improving diesel engine efficiency, reducing emissions, and enabling robust health monitoring have been critical research topics in engine modelling. While recent advancements in the use of neural networks for system monitoring have shown promising results, such methods often focus on component-level analysis, lack generalizability, and physical interpretability. In this study, we propose a novel hybrid framework that combines physics-informed neural networks (PINNs) with deep operator networks (DeepONet) to enable accurate and computationally efficient parameter identification in mean-value diesel engine models. Our method leverages physics-based system knowledge in combination with data-driven training of neural networks to enhance model applicability. Incorporating offline-trained DeepONets to predict actuator dynamics significantly lowers the online computation cost when compared to the existing PINN framework. To address the re-training burden typical of PINNs under varying input conditions, we propose two transfer learning (TL) strategies: (i) a multi-stage TL scheme offering better runtime efficiency than full online training of the PINN model and (ii) a few-shot TL scheme that freezes a shared multi-head network body and computes physics-based derivatives required for model training outside the training loop. The second strategy offers a computationally inexpensive and physics-based approach for predicting engine dynamics and parameter identification, offering computational efficiency over the existing PINN framework. Compared to existing health monitoring methods, our framework combines the interpretability of physics-based models with the flexibility of deep learning, offering substantial gains in generalization, accuracy, and deployment efficiency for diesel engine diagnostics.
♻ ☆ COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.
♻ ☆ How a Bilingual LM Becomes Bilingual: Tracing Internal Representations with Sparse Autoencoders
This study explores how bilingual language models develop complex internal representations. We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes. Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the mid layers. We also found that this bilingual tendency is stronger in larger models. Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model. Our results provide insights into how language models acquire bilingual capabilities.
comment: 13 pages, 17 figures, accepted to EMNLP 2025 findings
♻ ☆ ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation
We introduce ClustRecNet - a novel deep learning (DL)-based recommendation framework for determining the most suitable clustering algorithms for a given dataset, addressing the long-standing challenge of clustering algorithm selection in unsupervised learning. To enable supervised learning in this context, we construct a comprehensive data repository comprising 34,000 synthetic datasets with diverse structural properties. Each of them was processed using 10 popular clustering algorithms. The resulting clusterings were assessed via the Adjusted Rand Index (ARI) to establish ground truth labels, used for training and evaluation of our DL model. The proposed network architecture integrates convolutional, residual, and attention mechanisms to capture both local and global structural patterns from the input data. This design supports end-to-end training to learn compact representations of datasets and enables direct recommendation of the most suitable clustering algorithm, reducing reliance on handcrafted meta-features and traditional Cluster Validity Indices (CVIs). Comprehensive experiments across synthetic and real-world benchmarks demonstrate that our DL model consistently outperforms conventional CVIs (e.g. Silhouette, Calinski-Harabasz, Davies-Bouldin, and Dunn) as well as state-of-the-art AutoML clustering recommendation approaches (e.g. ML2DAC, AutoCluster, and AutoML4Clust). Notably, the proposed model achieves a 0.497 ARI improvement over the Calinski-Harabasz index on synthetic data and a 15.3% ARI gain over the best-performing AutoML approach on real-world data.
♻ ☆ Language Model Guided Reinforcement Learning in Quantitative Trading
Algorithmic trading requires short-term tactical decisions consistent with long-term financial objectives. Reinforcement Learning (RL) has been applied to such problems, but adoption is limited by myopic behaviour and opaque policies. Large Language Models (LLMs) offer complementary strategic reasoning and multi-modal signal interpretation when guided by well-structured prompts. This paper proposes a hybrid framework in which LLMs generate high-level trading strategies to guide RL agents. We evaluate (i) the economic rationale of LLM-generated strategies through expert review, and (ii) the performance of LLM-guided agents against unguided RL baselines using Sharpe Ratio (SR) and Maximum Drawdown (MDD). Empirical results indicate that LLM guidance improves both return and risk metrics relative to standard RL.
comment: 12 pages (4 pages appendix and references) and 6 figures. Accepted for presentation at FLLM 2025, Vienna
♻ ☆ TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer
Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100\times$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to $11$ established imputation methods.
♻ ☆ Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions
In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.
comment: 9 pages
♻ ☆ CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models
Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.
comment: Accepted to the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025 Main)
♻ ☆ FREE: The Foundational Semantic Recognition for Modeling Environmental Ecosystems
Modeling environmental ecosystems is critical for the sustainability of our planet, but is extremely challenging due to the complex underlying processes driven by interactions amongst a large number of physical variables. As many variables are difficult to measure at large scales, existing works often utilize a combination of observable features and locally available measurements or modeled values as input to build models for a specific study region and time period. This raises a fundamental question in advancing the modeling of environmental ecosystems: how to build a general framework for modeling the complex relationships among diverse environmental variables over space and time? In this paper, we introduce a framework, FREE, that enables the use of varying features and available information to train a universal model. The core idea is to map available environmental data into a text space and then convert the traditional predictive modeling task in environmental science to a semantic recognition problem. Our evaluation on two societally important real-world applications, stream water temperature prediction and crop yield prediction, demonstrates the superiority of FREE over multiple baselines, even in data-sparse scenarios.
♻ ☆ Filtering out mislabeled training instances using black-box optimization and quantum annealing
This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.
♻ ☆ Neural Beam Field for Spatial Beam RSRP Prediction
Accurately predicting beam-level reference signal received power (RSRP) is essential for beam management in dense multi-user wireless networks, yet challenging due to high measurement overhead and fast channel variations. This paper proposes Neural Beam Field (NBF), a hybrid neural-physical framework for efficient and interpretable spatial beam RSRP prediction. Central to our approach is the introduction of the Multi-path Conditional Power Profile (MCPP), a learnable physical intermediary representing the site-specific propagation environment. This approach decouples the environment from specific antenna/beam configurations, which helps the model learn site-specific multipath features and enhances its generalization capability. We adopt a decoupled ``blackbox-whitebox" design: a Transformer-based deep neural network (DNN) learns the MCPP from sparse user measurements and positions, while a physics-inspired module analytically infers beam RSRP statistics. To improve convergence and adaptivity, we further introduce a Pretrain-and-Calibrate (PaC) strategy that leverages ray-tracing priors for physics-grounded pretraining and then RSRP data for on-site calibration. Extensive simulation results demonstrate that NBF significantly outperforms conventional table-based channel knowledge maps (CKMs) and pure blackbox DNNs in prediction accuracy, training efficiency, and generalization, while maintaining a compact model size. The proposed framework offers a scalable and physically grounded solution for intelligent beam management in next-generation dense wireless networks.
comment: Keywords: Neural Beam Field, Multipath Conditional Power Profile, Channel Knowledge Map, Beam-level RSRP, Transformer. Revised technical presentation and added more benchmark comparisons
♻ ☆ Networked Information Aggregation via Machine Learning
We study a distributed learning problem in which learning agents are embedded in a directed acyclic graph (DAG). There is a fixed and arbitrary distribution over feature/label pairs, and each agent or vertex in the graph is able to directly observe only a subset of the features -- potentially a different subset for every agent. The agents learn sequentially in some order consistent with a topological sort of the DAG, committing to a model mapping observations to predictions of the real-valued label. Each agent observes the predictions of their parents in the DAG, and trains their model using both the features of the instance that they directly observe, and the predictions of their parents as additional features. We ask when this process is sufficient to achieve \emph{information aggregation}, in the sense that some agent in the DAG is able to learn a model whose error is competitive with the best model that could have been learned (in some hypothesis class) with direct access to \emph{all} features, despite the fact that no single agent in the network has such access. We give upper and lower bounds for this problem for both linear and general hypothesis classes. Our results identify the \emph{depth} of the DAG as the key parameter: information aggregation can occur over sufficiently long paths in the DAG, assuming that all of the relevant features are well represented along the path, and there are distributions over which information aggregation cannot occur even in the linear case, and even in arbitrarily large DAGs that do not have sufficient depth (such as a hub-and-spokes topology in which the spoke vertices collectively see all the features). We complement our theoretical results with a comprehensive set of experiments.
♻ ☆ Early-Warning of Thunderstorm-Driven Power Outages with a Two-Stage Machine Learning Model
Thunderstorm-driven outages are difficult to predict because most storms do not cause damage, convective processes occur rapidly and chaotically, and the available public data are both noisy and incomplete. We develop a 24-48 h early-warning model for summer, thunderstorm-related outages in Michigan using only open sources (EAGLE-I for ground truth; METAR for weather). We use the publicly released EAGLE-I outage dataset (2014-2022), maintained by Oak Ridge National Laboratory for the U.S. Department of Energy. The pipeline preserves convective micro-signals from a sparse station network via parameter-specific kriging with hourly variograms and targeted overdrafting to retain extremes, and builds causal spatio-temporal features (lags/rolling statistics; k-NN/IDW spatial aggregates) capturing precursors of severe convection (moisture advection, wind shifts, and pressure drops). The two-stage model design, combining a logistic gate and an LSTM regressor, limits routine periods and reduces noise exposure. The study uses event-centric metrics (cluster-based hits/misses/false alarms) and peak-conditional MASE (cMASE) in +/-Delta-hour windows around state-level peaks (>= 50,000), with uncertainty quantified by hourly moving-block bootstrap. On the test sample, Two-Stage detects more reference peaks across all windows (e.g., at +/-48 h it records 3/4 vs. 2/4; F1 66.7% vs. 57.1%) with one extra false alarm. Near peaks, it shows modest amplitude gains (2-3% lower cMASE at +/-0-12 h; bootstrap medians +9-13% at +/-6-12 h) but small losses at +/-36-48 h (~3-4%). Overall, errors are comparable to the one-step LSTM baseline. SHAP analysis confirms moisture-advection and wind/gust precursors, underscoring the value of the feature engineering. Despite open-data noise, the feature-driven pipeline yields actionable, event-focused early warnings for thunderstorm outages.
comment: 23 pages (main), 70 pages incl. appendices; figures & tables as in manuscript. Code (main figure, synthetic data): https://github.com//peak-outage-forecasting License: CC BY 4.0 (preprint)
♻ ☆ Bures-Wasserstein Flow Matching for Graph Generation
Graph generation has emerged as a critical task in fields ranging from drug discovery to circuit design. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between reference and data distributions. However, these methods typically model the evolution of individual nodes and edges independently and use linear interpolations to build the path. This disentangled interpolation breaks the interconnected patterns of graphs, making the constructed probability path irregular and non-smooth, which causes poor training dynamics and faulty sampling convergence. To address the limitation, this paper first presents a theoretically grounded framework for probability path construction in graph generative models. Specifically, we model the joint evolution of the nodes and edges by representing graphs as connected systems parameterized by Markov random fields (MRF). We then leverage the optimal transport displacement between MRF objects to design a smooth probability path that ensures the co-evolution of graph components. Based on this, we introduce BWFlow, a flow-matching framework for graph generation that utilizes the derived optimal probability path to benefit the training and sampling algorithm design. Experimental evaluations in plain graph generation and molecule generation validate the effectiveness of BWFlow with competitive performance, better training convergence, and efficient sampling.
♻ ☆ Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning NeurIPS 2025
Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have explored training-free methods using first- and second-order statistics to aggregate local client data distributions at the server and achieve high performance without any training. In this work, we propose a training-free method based on an unbiased estimator of class covariance matrices which only uses first-order statistics in the form of class means communicated by clients to the server. We show how these estimated class covariances can be used to initialize the global classifier, thus exploiting the covariances without actually sharing them. We also show that using only within-class covariances results in a better classifier initialization. Our approach improves performance in the range of 4-26% with exactly the same communication cost when compared to methods sharing only class means and achieves performance competitive or superior to methods sharing second-order statistics with dramatically less communication overhead. The proposed method is much more communication-efficient than federated prompt-tuning methods and still outperforms them. Finally, using our method to initialize classifiers and then performing federated fine-tuning or linear probing again yields better performance. Code is available at https://github.com/dipamgoswami/FedCOF.
comment: Accepted at NeurIPS 2025
♻ ☆ CAFL-L: Constraint-Aware Federated Learning with Lagrangian Dual Optimization for On-Device Language Models NeurIPS
We introduce Constraint-Aware Federated Learning with Lagrangian Dual Optimization (CAFL-L), a principled extension of FedAvg that explicitly incorporates device-level resource constraints including energy, communication, memory, and thermal budgets. CAFL-L employs Lagrangian dual optimization to dynamically adapt training hyperparameters -- freezing depth, local steps, batch size, and communication compression -- while preserving training stability through token-budget preservation via gradient accumulation. Experiments on a character-level language model demonstrate that CAFL-L achieves superior constraint satisfaction compared to standard FedAvg (reducing memory usage by 20% and communication by 95%) while maintaining competitive validation performance, making it practical for deployment on resource-constrained edge devices.
comment: Accepted by 39th NeurIPS - Constrained Optimization for Machine Learning
♻ ☆ From Contextual Data to Newsvendor Decisions: On the Actual Performance of Data-Driven Algorithms
In this work, we study how the relevance/quality and quantity of past data influence performance by analyzing a contextual Newsvendor problem, in which a decision-maker trades off between underage and overage costs under uncertain demand. We consider a setting in which past demands observed under ``close by'' contexts come from close by distributions and analyze the performance of data-driven algorithms through a notion of context-dependent worst-case expected regret. We analyze the broad class of Weighted Empirical Risk Minimization (WERM) policies which weigh past data according to their similarity in the contextual space. This class includes classical policies such as ERM, k-Nearest Neighbors and kernel-based policies. Our main methodological contribution is to characterize exactly the worst-case regret of any WERM policy on any given configuration of contexts. To the best of our knowledge, this provides the first understanding of tight performance guarantees in any contextual decision-making problem, with past literature focusing on upper bounds via concentration inequalities. We instead take an optimization approach, and isolate a structure in the Newsvendor loss function that allows to reduce the infinite-dimensional optimization problem over worst-case distributions to a simple line search. This in turn allows us to unveil fundamental insights that were obfuscated by previous general-purpose bounds. We characterize actual guaranteed performance as a function of the contexts, as well as granular insights on the learning curve of algorithms.
♻ ☆ Aggregation on Learnable Manifolds for Asynchronous Federated Optimization AISTATS 26
Asynchronous federated learning (FL) with heterogeneous clients faces two key issues: curvature-induced loss barriers encountered by standard linear parameter interpolation techniques (e.g. FedAvg) and interference from stale updates misaligned with the server's current optimisation state. To alleviate these issues, we introduce a geometric framework that casts aggregation as curve learning in a Riemannian model space and decouples trajectory selection from update conflict resolution. Within this, we propose AsyncBezier, which replaces linear aggregation with low-degree polynomial (Bezier) trajectories to bypass loss barriers, and OrthoDC, which projects delayed updates via inner product-based orthogonality to reduce interference. We establish framework-level convergence guarantees covering each variant given simple assumptions on their components. On three datasets spanning general-purpose and healthcare domains, including LEAF Shakespeare and FEMNIST, our approach consistently improves accuracy and client fairness over strong asynchronous baselines; finally, we show that these gains are preserved even when other methods are allocated a higher local compute budget.
comment: 18 pages, 5 figures. [v3] Updated from technical report to condensed conference paper under review at AISTATS 26
♻ ☆ K-ASTRO: Structure-Aware Adaptation of LLMs for Code Vulnerability Detection
Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.
♻ ☆ Causal Additive Models with Unobserved Causal Paths and Backdoor Paths
Causal additive models provide a tractable yet expressive framework for causal discovery in the presence of hidden variables. However, when unobserved backdoor or causal paths exist between two variables, their causal relationship is often unidentifiable under existing theories. We establish sufficient conditions under which causal directions can be identified in many such cases. In particular, we derive conditions that enable identification of the parent-child relationship in a bow, an adjacent pair of observed variables sharing a hidden common parent. This represents a notoriously difficult case in causal discovery, and, to our knowledge, no prior work has established such identifiability in any causal model without imposing assumptions on the hidden variables. Our conditions rely on new characterizations of regression sets and a hybrid approach that combines independence among regression residuals with conditional independencies among observed variables. We further provide a sound and complete algorithm that incorporates these insights, and empirical evaluations demonstrate competitive performance with state-of-the-art methods.
comment: 23 pages
♻ ☆ Preference Discerning with LLM-Enhanced Generative Retrieval
In sequential recommendation, models recommend items based on user's interaction history. To this end, current models usually incorporate information such as item descriptions and user intent or preferences. User preferences are usually not explicitly given in open-source datasets, and thus need to be approximated, for example via large language models (LLMs). Current approaches leverage approximated user preferences only during training and rely solely on the past interaction history for recommendations, limiting their ability to dynamically adapt to changing preferences, potentially reinforcing echo chambers. To address this issue, we propose a new paradigm, namely preference discerning, which explicitly conditions a generative recommendation model on user preferences in natural language within its context. To evaluate preference discerning, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. Upon evaluating current state-of-the-art methods on our benchmark, we discover that their ability to dynamically adapt to evolving user preferences is limited. To address this, we propose a new method named Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce $\textbf{D}$iscern$\textbf{er}$), which achieves state-of-the-art performance in our benchmark. Our results show that Mender effectively adapts its recommendation guided by human preferences, even if not observed during training, paving the way toward more flexible recommendation models.
comment: Accepted at TMLR, Code available at https://github.com/facebookresearch/preference_discerning
♻ ☆ What's Inside Your Diffusion Model? A Score-Based Riemannian Metric to Explore the Data Manifold
Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold's contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.
♻ ☆ From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .
♻ ☆ NIMO: a Nonlinear Interpretable MOdel
Deep learning has achieved remarkable success across many domains, but it has also created a growing demand for interpretability in model predictions. Although many explainable machine learning methods have been proposed, post-hoc explanations lack guaranteed fidelity and are sensitive to hyperparameter choices, highlighting the appeal of inherently interpretable models. For example, linear regression provides clear feature effects through its coefficients. However, such models are often outperformed by more complex neural networks (NNs) that usually lack inherent interpretability. To address this dilemma, we introduce NIMO, a framework that combines inherent interpretability with the expressive power of neural networks. Building on the simple linear regression, NIMO is able to provide flexible and intelligible feature effects. Relevantly, we develop an optimization method based on parameter elimination, that allows for optimizing the NN parameters and linear coefficients effectively and efficiently. By relying on adaptive ridge regression we can easily incorporate sparsity as well. We show empirically that our model can provide faithful and intelligible feature effects while maintaining good predictive performance.
♻ ☆ MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment
We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.
♻ ☆ From Entity Reliability to Clean Feedback: An Entity-Aware Denoising Framework Beyond Interaction-Level Signals
Implicit feedback is central to modern recommender systems but is inherently noisy, often impairing model training and degrading user experience. At scale, such noise can mislead learning processes, reducing both recommendation accuracy and platform value. Existing denoising strategies typically overlook the entity-specific nature of noise while introducing high computational costs and complex hyperparameter tuning. To address these challenges, we propose \textbf{EARD} (\textbf{E}ntity-\textbf{A}ware \textbf{R}eliability-\textbf{D}riven Denoising), a lightweight framework that shifts the focus from interaction-level signals to entity-level reliability. Motivated by the empirical observation that training loss correlates with noise, EARD quantifies user and item reliability via their average training losses as a proxy for reputation, and integrates these entity-level factors with interaction-level confidence. The framework is \textbf{model-agnostic}, \textbf{computationally efficient}, and requires \textbf{only two intuitive hyperparameters}. Extensive experiments across multiple datasets and backbone models demonstrate that EARD yields substantial improvements over state-of-the-art baselines (e.g., up to 27.01\% gain in NDCG@50), while incurring negligible additional computational cost. Comprehensive ablation studies and mechanism analyses further confirm EARD's robustness to hyperparameter choices and its practical scalability. These results highlight the importance of entity-aware reliability modeling for denoising implicit feedback and pave the way for more robust recommendation research.
♻ ☆ Revisiting associative recall in modern recurrent models
Despite the advantageous subquadratic complexity of modern recurrent deep learning models -- such as state-space models (SSMs) -- recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba's performance and optimization stability.
♻ ☆ CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models NeurIPS 2025
Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of both linearly and nonlinearly coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on https://kausable.github.io/CausalDynamics.
comment: NeurIPS 2025
♻ ☆ MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging
Simulating in silico cellular responses to interventions is a promising direction to accelerate high-content image-based assays, critical for advancing drug discovery and gene editing. To support this, we introduce MorphGen, a state-of-the-art diffusion-based generative model for fluorescent microscopy that enables controllable generation across multiple cell types and perturbations. To capture biologically meaningful patterns consistent with known cellular morphologies, MorphGen is trained with an alignment loss to match its representations to the phenotypic embeddings of OpenPhenom, a state-of-the-art biological foundation model. Unlike prior approaches that compress multichannel stains into RGB images -- thus sacrificing organelle-specific detail -- MorphGen generates the complete set of fluorescent channels jointly, preserving per-organelle structures and enabling a fine-grained morphological analysis that is essential for biological interpretation. We demonstrate biological consistency with real images via CellProfiler features, and MorphGen attains an FID score over 35% lower than the prior state-of-the-art MorphoDiff, which only generates RGB images for a single cell type. Code is available at https://github.com/czi-ai/MorphGen.
♻ ☆ AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.
comment: The paper is currently under investigation regarding concerns of potential academic misconduct. While the investigation is ongoing, the authors have voluntarily requested to withdraw the manuscript
♻ ☆ Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions
In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.
comment: 54 pages, 20 figures
♻ ☆ DQS: A Low-Budget Query Strategy for Enhancing Unsupervised Data-driven Anomaly Detection Approaches
Truly unsupervised approaches for time series anomaly detection are rare in the literature. Those that exist suffer from a poorly set threshold, which hampers detection performance, while others, despite claiming to be unsupervised, need to be calibrated using a labelled data subset, which is often not available in the real world. This work integrates active learning with an existing unsupervised anomaly detection method by selectively querying the labels of multivariate time series, which are then used to refine the threshold selection process. To achieve this, we introduce a novel query strategy called the dissimilarity-based query strategy (DQS). DQS aims to maximise the diversity of queried samples by evaluating the similarity between anomaly scores using dynamic time warping. We assess the detection performance of DQS in comparison to other query strategies and explore the impact of mislabelling, a topic that is underexplored in the literature. Our findings indicate that DQS performs best in small-budget scenarios, though the others appear to be more robust when faced with mislabelling. Therefore, in the real world, the choice of query strategy depends on the expertise of the oracle and the number of samples they are willing to label. Regardless, all query strategies outperform the unsupervised threshold even in the presence of mislabelling. Thus, whenever it is feasible to query an oracle, employing an active learning-based threshold is recommended.
comment: Submitted to the Information Sciences journal
♻ ☆ Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data
Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.
♻ ☆ Gradient-based Sample Selection for Faster Bayesian Optimization
Bayesian optimization (BO) is an effective technique for black-box optimization. However, its applicability is typically limited to moderate-budget problems due to the cubic complexity of fitting the Gaussian process (GP) surrogate model. In large-budget scenarios, directly employing the standard GP model faces significant challenges in computational time and resource requirements. In this paper, we propose a novel approach, gradient-based sample selection Bayesian Optimization (GSSBO), to enhance the computational efficiency of BO. The GP model is constructed on a selected set of samples instead of the whole dataset. These samples are selected by leveraging gradient information to remove redundancy while preserving diversity and representativeness. We provide a theoretical analysis of the gradient-based sample selection strategy and obtain explicit sublinear regret bounds for our proposed framework. Extensive experiments on synthetic and real-world tasks demonstrate that our approach significantly reduces the computational cost of GP fitting in BO while maintaining optimization performance comparable to baseline methods.
♻ ☆ Collaborative Unlabeled Data Optimization
This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.
♻ ☆ Machine Learning Detection of Lithium Plating in Lithium-ion Cells: A Gaussian Process Approach
Lithium plating during fast charging is a critical degradation mechanism that accelerates capacity fade and can trigger catastrophic safety failures. Recent work has identified a distinctive dQ/dV peak above 4.0 V as a reliable signature of plating onset; however, conventional methods for computing dQ/dV rely on finite differencing with filtering, which amplifies sensor noise and introduces bias in peak location. In this paper, we propose a Gaussian Process (GP) framework for lithium plating detection by directly modeling the charge-voltage relationship Q(V) as a stochastic process with calibrated uncertainty. Leveraging the property that derivatives of GPs remain GPs, we infer dQ/dV analytically and probabilistically from the posterior, enabling robust detection without ad hoc smoothing. The framework provides three key benefits: (i) noise-aware inference with hyperparameters learned from data, (ii) closed-form derivatives with credible intervals for uncertainty quantification, and (iii) scalability to online variants suitable for embedded BMS. Experimental validation on Li-ion coin cells across a range of C-rates (0.2C-1C) and temperatures (0-40\deg C) demonstrates that the GP-based method reliably detects plating peaks under low-temperature, high-rate charging, while correctly reporting no peaks in baseline cases. The concurrence of GP-identified differential peaks, reduced charge throughput, and capacity fade measured via reference performance tests confirms the method's accuracy and robustness, establishing a practical pathway for real-time lithium plating detection.
comment: Submitted to American Control Conference 2026 - ACC 2026
♻ ☆ Multiparameter regularization and aggregation in the context of polynomial functional regression
Most of the recent results in polynomial functional regression have been focused on an in-depth exploration of single-parameter regularization schemes. In contrast, in this study we go beyond that framework by introducing an algorithm for multiple parameter regularization and presenting a theoretically grounded method for dealing with the associated parameters. This method facilitates the aggregation of models with varying regularization parameters. The efficacy of the proposed approach is assessed through evaluations on both synthetic and some real-world medical data, revealing promising results.
comment: 24 pages, to appear in Analysis and Applications
♻ ☆ Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives
The two-timescale gradient descent-ascent (GDA) is a canonical gradient algorithm designed to find Nash equilibria in min-max games. We analyze the two-timescale GDA by investigating the effects of learning rate ratios on convergence behavior in both finite-dimensional and mean-field settings. In particular, for finite-dimensional quadratic min-max games, we obtain long-time convergence in near quasi-static regimes through the hypocoercivity method. For mean-field GDA dynamics, we investigate convergence under a finite-scale ratio using a mixed synchronous-reflection coupling technique.
comment: v3: added analysis of preconditioned GDA for finite-dimensional case and improved results for mean-field case
♻ ☆ PrivATE: Differentially Private Confidence Intervals for Average Treatment Effects
The average treatment effect (ATE) is widely used to evaluate the effectiveness of drugs and other medical interventions. In safety-critical applications like medicine, reliable inferences about the ATE typically require valid uncertainty quantification, such as through confidence intervals (CIs). However, estimating treatment effects in these settings often involves sensitive data that must be kept private. In this work, we present PrivATE, a novel machine learning framework for computing CIs for the ATE under differential privacy. Specifically, we focus on deriving valid privacy-preserving CIs for the ATE from observational data. Our PrivATE framework consists of three steps: (i) estimating the differentially private ATE through output perturbation; (ii) estimating the differentially private variance in a doubly robust manner; and (iii) constructing the CIs while accounting for the uncertainty from both the estimation and privatization steps. Our PrivATE framework is model agnostic, doubly robust, and ensures valid CIs. We demonstrate the effectiveness of our framework using synthetic and real-world medical datasets. To the best of our knowledge, we are the first to derive a general, doubly robust framework for valid CIs of the ATE under ($\varepsilon,\delta$)-differential privacy.
♻ ☆ MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.
♻ ☆ Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices
Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks.
comment: Accepted to the EMNLP 2025 Industry Track
♻ ☆ Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach
Pneumonia, a prevalent respiratory infection, remains a leading cause of morbidity and mortality worldwide, particularly among vulnerable populations. Chest X-rays serve as a primary tool for pneumonia detection; however, variations in imaging conditions and subtle visual indicators complicate consistent interpretation. Automated tools can enhance traditional methods by improving diagnostic reliability and supporting clinical decision-making. In this study, we propose a novel multi-scale transformer approach for pneumonia detection that integrates lung segmentation and classification into a unified framework. Our method introduces a lightweight transformer-enhanced TransUNet for precise lung segmentation, achieving a Dice score of 95.68% on the "Chest X-ray Masks and Labels" dataset with fewer parameters than traditional transformers. For classification, we employ pre-trained ResNet models (ResNet-50 and ResNet-101) to extract multi-scale feature maps, which are then processed through a modified transformer module to enhance pneumonia detection. This integration of multi-scale feature extraction and lightweight transformer modules ensures robust performance, making our method suitable for resource-constrained clinical environments. Our approach achieves 93.75% accuracy on the "Kermany" dataset and 96.04% accuracy on the "Cohen" dataset, outperforming existing methods while maintaining computational efficiency. This work demonstrates the potential of multi-scale transformer architectures to improve pneumonia diagnosis, offering a scalable and accurate solution to global healthcare challenges. https://github.com/amirrezafateh/Multi-Scale-Transformer-Pneumonia
♻ ☆ Hierarchical autoregressive neural networks in three-dimensional statistical system
Autoregressive Neural Networks (ANN) have been recently proposed as a mechanism to improve the efficiency of Monte Carlo algorithms for several spin systems. The idea relies on the fact that the total probability of a configuration can be factorized into conditional probabilities of each spin, which in turn can be approximated by a neural network. Once trained, the ANNs can be used to sample configurations from the approximated probability distribution and to explicitly evaluate this probability for a given configuration. It has also been observed that such conditional probabilities give access to information-theoretic observables such as mutual information or entanglement entropy. In this paper, we describe the hierarchical autoregressive network (HAN) algorithm in three spatial dimensions and study its performance using the example of the Ising model. We compare HAN with three other autoregressive architectures and the classical Wolff cluster algorithm. Finally, we provide estimates of thermodynamic observables for the three-dimensional Ising model, such as entropy and free energy, in a range of temperatures across the phase transition.
comment: 19 pages, 8 figures
♻ ☆ Robustness in Both Domains: CLIP Needs a Robust Text Encoder NeurIPS 2025
Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).
comment: Accepted in NeurIPS 2025
♻ ☆ Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.
comment: 20 pages, 10 figures
♻ ☆ Multi-Scale Finetuning for Encoder-based Time Series Foundation Models NeurIPS 2025
Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal performance. Given the diverse temporal patterns across sampling scales and the inherent multi-scale forecasting capabilities of TSFMs, we adopt a causal perspective to analyze finetuning process, through which we highlight the critical importance of explicitly modeling multiple scales and reveal the shortcomings of naive approaches. Focusing on encoder-based TSFMs, we propose Multiscale finetuning (MSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process. Experimental results on three different backbones (Moirai, Moment and Units) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods. Codes are available at https://github.com/zqiao11/MSFT.
comment: Accepted by NeurIPS 2025
♻ ☆ Epistemic Errors of Imperfect Multitask Learners When Distributions Shift
Uncertainty-aware machine learners, such as Bayesian neural networks, output a quantification of uncertainty instead of a point prediction. In this work, we provide uncertainty-aware learners with a principled framework to characterize, and identify ways to eliminate, errors that arise from reducible (epistemic) uncertainty. We introduce a principled definition of epistemic error, and provide a decompositional epistemic error bound which operates in the very general setting of imperfect multitask learning under distribution shift. In this setting, the training (source) data may arise from multiple tasks, the test (target) data may differ systematically from the source data tasks, and/or the learner may not arrive at an accurate characterization of the source data. Our bound separately attributes epistemic errors to each of multiple aspects of the learning procedure and environment. As corollaries of the general result, we provide epistemic error bounds specialized to the settings of Bayesian transfer learning and distribution shift within $\epsilon$-neighborhoods. We additionally leverage the terms in our bound to provide a novel definition of negative transfer.
♻ ☆ ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection
Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.
comment: 4 pages, 1 reference, 3 figures
♻ ☆ System Prompt Optimization with Meta-Learning NeurIPS 2025
Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
comment: NeurIPS 2025
♻ ☆ Understanding and Improving Information Preservation in Prompt Compression for LLMs
Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Using our framework, we analyze state-of-the-art soft and hard compression methods and show that some fail to preserve key details from the original prompt, limiting performance on complex tasks. By identifying these limitations, we are able to improve one soft prompting method by controlling compression granularity, achieving up to +23% in downstream performance, +8 BERTScore points in grounding, and 2.7x more entities preserved in compression. Ultimately, we find that the best effectiveness/compression rate trade-off is achieved with soft prompting combined with sequence-level training.The code is available at https://github.com/amazon-science/information-preservation-in-prompt-compression.
comment: Accepted to EMNLP 2025 (Findings), 22 pages, 6 figures, 24 tables
♻ ☆ Transferable Parasitic Estimation via Graph Contrastive Learning and Label Rebalancing in AMS Circuits
Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the $R^2$ improvement of $33.64\% \sim 44.20\%$ for edge regression and F1-score gain of $0.9\times \sim 2.1\times$ for node classification. Our code is available at https://github.com/ShenShan123/CircuitGCL.
comment: Final version accepted by the International Conference on Computer-Aided Design (ICCAD) 2025. First two authors have equal contributions
♻ ☆ Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models
In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.
♻ ☆ Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo ICML 2025
A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on "decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.
comment: Accepted to ICML 2025, official PMLR proceedings can be found at https://proceedings.mlr.press/v267/ekstrom-kelvinius25b.html. Code available at https://github.com/filipekstrm/ddsmc
♻ ☆ WyckoffDiff -- A Generative Diffusion Model for Crystal Symmetry ICML 2025
Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fr\'echet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation. As a proof-of-concept study, we use WyckoffDiff to find new materials below the convex hull of thermodynamical stability.
comment: Accepted to ICML 2025, official PMLR proceedings can be found at https://proceedings.mlr.press/v267/ekstrom-kelvinius25a.html. Code is available online at https://github.com/httk/wyckoffdiff
♻ ☆ Robust LLM Training Infrastructure at ByteDance
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
♻ ☆ Towards Natural Machine Unlearning
Machine unlearning (MU) aims to eliminate information that has been learned from specific training data, namely forgetting data, from a pre-trained model. Currently, the mainstream of existing MU methods involves modifying the forgetting data with incorrect labels and subsequently fine-tuning the model. While learning such incorrect information can indeed remove knowledge, the process is quite unnatural as the unlearning process undesirably reinforces the incorrect information and leads to over-forgetting. Towards more \textit{natural} machine unlearning, we inject correct information from the remaining data to the forgetting samples when changing their labels. Through pairing these adjusted samples with their labels, the model will tend to use the injected correct information and naturally suppress the information meant to be forgotten. Albeit straightforward, such a first step towards natural machine unlearning can significantly outperform current state-of-the-art approaches. In particular, our method substantially reduces the over-forgetting and leads to strong robustness to hyperparameters, making it a promising candidate for practical machine unlearning.
comment: Accepted by IEEE TPAMI
♻ ☆ TriP-LLM: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection
Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a triple-branch design comprising Patching, Selecting, and Global modules, to encode the input time-series into patch-wise representations, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art (SOTA) methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints of TriP-LLM are publicly available on https://github.com/YYZStart/TriP-LLM.git
comment: Accepted version of the paper published in IEEE Access (2025). Licensed under a Creative Commons Attribution 4.0 License (CC BY 4.0). Published version available at IEEE Xplore
♻ ☆ Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training
Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, exposing gradients during training can leak sensitive information about the underlying data, raising privacy and security concerns such as susceptibility to data poisoning attacks. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide non-vacuous generalization bounds and strong theoretical guarantees for differential privacy, robustness to data poisoning attacks, and extraction attacks. In experiments with LLMs, we demonstrate empirically that black-box optimization methods-despite the scalability and computational challenges inherent to black-box approaches-are able to learn, showing how a few iterations of BBoxER improve performance, generalize well on a benchmark of reasoning datasets, and are robust to membership inference attacks. This positions BBoxER as an attractive add-on on top of gradient-based optimization, offering suitability for deployment in restricted or privacy-sensitive environments while also providing non-vacuous generalization guarantees.
♻ ☆ SolNet: Open-source deep learning models for photovoltaic power forecasting across the globe
Deep learning models have gained increasing prominence in recent years in the field of solar pho-tovoltaic (PV) forecasting. One drawback of these models is that they require a lot of high-quality data to perform well. This is often infeasible in practice, due to poor measurement infrastructure in legacy systems and the rapid build-up of new solar systems across the world. This paper proposes SolNet: a novel, general-purpose, multivariate solar power forecaster, which addresses these challenges by using a two-step forecasting pipeline which incorporates transfer learning from abundant synthetic data generated from PVGIS, before fine-tuning on observational data. Using actual production data from hundreds of sites in the Netherlands, Australia and Belgium, we show that SolNet improves forecasting performance over data-scarce settings as well as baseline models. We find transfer learning benefits to be the strongest when only limited observational data is available. At the same time we provide several guidelines and considerations for transfer learning practitioners, as our results show that weather data, seasonal patterns, amount of synthetic data and possible mis-specification in source location, can have a major impact on the results. The SolNet models created in this way are applicable for any land-based solar photovoltaic system across the planet where simulated and observed data can be combined to obtain improved forecasting capabilities.
comment: 24 pages, 5 figures
♻ ☆ VAGPO: Vision-augmented Asymmetric Group Preference Optimization for Graph Routing Problems
Graph routing problems play a vital role in web-related networks, where finding optimal paths across graphs is essential for efficient data transmission and content delivery. Classic routing formulations such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) represent fundamental graph optimization challenges. Recent data-driven optimization methods have made significant progress, yet they often face limitations in training efficiency and generalization to large-scale instances. In this paper, we propose a novel Vision-augmented Asymmetric Group Preference Optimization (VAGPO) approach. By leveraging ResNet-based visual encoding and Transformer-based sequential modeling, VAGPO captures both spatial structure and temporal dependencies. Furthermore, we introduce an asymmetric group preference optimization strategy that significantly accelerates convergence compared to commonly used policy gradient methods. Experimental results on generated TSP and CVRP instances, as well as real-world datasets, demonstrate that the proposed VAGPO approach achieves highly competitive solution quality. Additionally, VAGPO exhibits strong generalization to larger instances (up to 1000 nodes) without re-training, highlighting its effectiveness in both learning efficiency and scalability.
♻ ☆ Towards the Training of Deeper Predictive Coding Neural Networks
Predictive coding networks are neural models that perform inference through an iterative energy minimization process, whose operations are local in space and time. While effective in shallow architectures, they suffer significant performance degradation beyond five to seven layers. In this work, we show that this degradation is caused by exponentially imbalanced errors between layers during weight updates, and by predictions from the previous layers not being effective in guiding updates in deeper layers. Furthermore, when training models with skip connections, the energy propagated by the residuals reaches higher layers faster than that propagated by the main pathway, affecting test accuracy. We address the first issue by introducing a novel precision-weighted optimization of latent variables that balances error distributions during the relaxation phase, the second issue by proposing a novel weight update mechanism that reduces error accumulation in deeper layers, and the third one by using auxiliary neurons that slow down the propagation of the energy in the residual connections. Empirically, our methods achieve performance comparable to backpropagation on deep models such as ResNets, opening new possibilities for predictive coding in complex tasks.
comment: 18 Pages, 7 figures
♻ ☆ Fewer Weights, More Problems: A Practical Attack on LLM Pruning
Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.
♻ ☆ Learning Robust Diffusion Models from Imprecise Supervision
Conditional diffusion models have achieved remarkable success in various generative tasks recently, but their training typically relies on large-scale datasets that inevitably contain imprecise information in conditional inputs. Such supervision, often stemming from noisy, ambiguous, or incomplete labels, will cause condition mismatch and degrade generation quality. To address this challenge, we propose DMIS, a unified framework for training robust Diffusion Models from Imprecise Supervision, which is the first systematic study within diffusion models. Our framework is derived from likelihood maximization and decomposes the objective into generative and classification components: the generative component models imprecise-label distributions, while the classification component leverages a diffusion classifier to infer class-posterior probabilities, with its efficiency further improved by an optimized timestep sampling strategy. Extensive experiments on diverse forms of imprecise supervision, covering tasks of image generation, weakly supervised learning, and noisy dataset condensation demonstrate that DMIS consistently produces high-quality and class-discriminative samples.
♻ ☆ G$^2$RPO: Granular GRPO for Precise Reward in Flow Models
The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G$^2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G$^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
comment: Project Page: https://bujiazi.github.io/g2rpo.github.io/
♻ ☆ Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data NeurIPS 2025
Causality in time series can be challenging to determine, especially in the presence of non-linear dependencies. Granger causality helps analyze potential relationships between variables, thereby offering a method to determine whether one time series can predict-Granger cause-future values of another. Although successful, Granger causal methods still struggle with capturing long-range relations between variables. To this end, we leverage the recently successful Extended Long Short-Term Memory (xLSTM) architecture and propose Granger causal xLSTMs (GC-xLSTM). It first enforces sparsity between the time series components by using a novel dynamic loss penalty on the initial projection. Specifically, we adaptively improve the model and identify sparsity candidates. Our joint optimization procedure then ensures that the Granger causal relations are recovered robustly. Our experimental evaluation on six diverse datasets demonstrates the overall efficacy of GC-xLSTM.
comment: Accepted at NeurIPS 2025
♻ ☆ Unsupervised Multi-Source Federated Domain Adaptation under Domain Diversity through Group-Wise Discrepancy Minimization
Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.
♻ ☆ Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions
In reinforcement learning, off-policy actor-critic methods like DDPG and TD3 use deterministic policy gradients: the Q-function is learned from environment data, while the actor maximizes it via gradient ascent. We observe that in complex tasks such as dexterous manipulation and restricted locomotion with mobility constraints, the Q-function exhibits many local optima, making gradient ascent prone to getting stuck. To address this, we introduce SAVO, an actor architecture that (i) generates multiple action proposals and selects the one with the highest Q-value, and (ii) approximates the Q-function repeatedly by truncating poor local optima to guide gradient ascent more effectively. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.
comment: Outstanding Paper Award on Empirical Reinforcement Learning Research, RLC 2025
♻ ☆ Accelerated Evolving Set Processes for Local PageRank Computation
This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by $\min\{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)\}$ to obtain an $\epsilon$-approximation of the PPR vector, where $m$ denotes the number of edges in the graph and $R$ is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only $\tilde{\mathcal{O}}(1/\sqrt{\alpha})$ such linear systems, where $\alpha$ is the damping factor. When $1/\epsilon^2\ll m$, this implies the existence of an algorithm that computes an $\ epsilon $-approximation of the PPR vector with an overall time complexity of $\tilde{\mathcal{O}}\left(R^2 / (\sqrt{\alpha}\epsilon^2)\right)$, independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.
♻ ☆ Sequential Monte Carlo for Policy Optimization in Continuous POMDPs NeurIPS 2025
Optimal decision-making under partial observability requires agents to balance reducing uncertainty (exploration) against pursuing immediate objectives (exploitation). In this paper, we introduce a novel policy optimization framework for continuous partially observable Markov decision processes (POMDPs) that explicitly addresses this challenge. Our method casts policy learning as probabilistic inference in a non-Markovian Feynman--Kac model that inherently captures the value of information gathering by anticipating future observations, without requiring suboptimal approximations or handcrafted heuristics. To optimize policies under this model, we develop a nested sequential Monte Carlo (SMC) algorithm that efficiently estimates a history-dependent policy gradient under samples from the optimal trajectory distribution induced by the POMDP. We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks, where existing methods struggle to act under uncertainty.
comment: Accepted at NeurIPS 2025
♻ ☆ ROC-n-reroll: How verifier imperfection affects test-time scaling
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
comment: 45 pages, 10 Figures
♻ ☆ Detecting Multilevel Manipulation from Limit Order Book via Cascaded Contrastive Representation Learning
Trade-based manipulation (TBM) undermines the fairness and stability of financial markets drastically. Spoofing, one of the most covert and deceptive TBM strategies, exhibits complex anomaly patterns across multilevel prices, while often being simplified as a single-level manipulation. These patterns are usually concealed within the rich, hierarchical information of the Limit Order Book (LOB), which is challenging to leverage due to high dimensionality and noise. To address this, we propose a representation learning framework combining a cascaded LOB representation architecture with supervised contrastive learning. Extensive experiments demonstrate that our framework consistently improves detection performance across diverse models, with Transformer-based architectures achieving state-of-the-art results. In addition, we conduct systematic analyses and ablation studies to investigate multilevel manipulation and the contributions of key components for detection, offering broader insights into representation learning and anomaly detection for complex time series data.
♻ ☆ HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions
We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a "delta" PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.
♻ ☆ Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance
Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.
comment: EMNLP 2025
♻ ☆ DeepOHeat-v1: Efficient Operator Learning for Fast and Trustworthy Thermal Simulation and Optimization in 3D-IC Design
Thermal analysis is crucial in 3D-IC design due to increased power density and complex heat dissipation paths. Although operator learning frameworks such as DeepOHeat~\cite{liu2023deepoheat} have demonstrated promising preliminary results in accelerating thermal simulation, they face critical limitations in prediction capability for multi-scale thermal patterns, training efficiency, and trustworthiness of results during design optimization. This paper presents DeepOHeat-v1, an enhanced physics-informed operator learning framework that addresses these challenges through three key innovations. First, we integrate Kolmogorov-Arnold Networks with learnable activation functions as trunk networks, enabling an adaptive representation of multi-scale thermal patterns. This approach achieves a 1.25x and 6.29x reduction in error in two representative test cases. Second, we introduce a separable training method that decomposes the basis function along the coordinate axes, achieving 62x training speedup and 31x GPU memory reduction in our baseline case, and enabling thermal analysis at resolutions previously infeasible due to GPU memory constraints. Third, we propose a confidence score to evaluate the trustworthiness of the predicted results, and further develop a hybrid optimization workflow that combines operator learning with finite difference (FD) using Generalized Minimal Residual (GMRES) method for incremental solution refinement, enabling efficient and trustworthy thermal optimization. Experimental results demonstrate that DeepOHeat-v1 achieves accuracy comparable to optimization using high-fidelity finite difference solvers, while speeding up the entire optimization process by $70.6\times$ in our test cases, effectively minimizing the peak temperature through optimal placement of heat-generating components. Open source code is available at https://github.com/xlyu0127/DeepOHeat-v1.
comment: Published by IEEE TCPMT
♻ ☆ Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context (MS-COCO) dataset, which contains human-annotated captions with an average length of approximately ten tokens. Although effective for general scene understanding, these short captions often fail to capture complex scenes and convey detailed information. Moreover, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects, thus overlooking finer details. In this paper, we present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models. Our proposed method requires no additional model training: given an image, it leverages pre-trained models from the literature to generate the initial captions, and then ranks them using a newly introduced image-text-based metric, which we name BLIPScore. Subsequently, the top two captions are fused using a Large Language Model (LLM) to produce the final, more detailed description. Experimental results on the MS-COCO and Flickr30k test sets demonstrate the effectiveness of our approach in terms of caption-image alignment and hallucination reduction according to the ALOHa, CAPTURE, and Polos metrics. A subjective study lends additional support to these results, suggesting that the captions produced by our model are generally perceived as more consistent with human judgment. By combining the strengths of diverse SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich and informative nature of human-generated descriptions. This advance enables the generation of more suitable captions for the training of both vision-language and captioning models.
comment: This manuscript has been accepted for publication in Springer Neural Computing and Applications
♻ ☆ Post-training quantization of vision encoders needs prefixing registers
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
♻ ☆ On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on mathematical reasoning problems and practical tool-use tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.
♻ ☆ Maximizing UAV Cellular Connectivity with Reinforcement Learning for BVLoS Path Planning
This paper presents a reinforcement learning (RL) based approach for path planning of cellular connected unmanned aerial vehicles (UAVs) operating beyond visual line of sight (BVLoS). The objective is to minimize travel distance while maximizing the quality of cellular link connectivity by considering real world aerial coverage constraints and employing an empirical aerial channel model. The proposed solution employs RL techniques to train an agent, using the quality of communication links between the UAV and base stations (BSs) as the reward function. Simulation results demonstrate the effectiveness of the proposed method in training the agent and generating feasible UAV path plans. The proposed approach addresses the challenges due to limitations in UAV cellular communications, highlighting the need for investigations and considerations in this area. The RL algorithm efficiently identifies optimal paths, ensuring maximum connectivity with ground BSs to ensure safe and reliable BVLoS flight operation. Moreover, the solution can be deployed as an offline path planning module that can be integrated into future ground control systems (GCS) for UAV operations, enhancing their capabilities and safety. The method holds potential for complex long range UAV applications, advancing the technology in the field of cellular connected UAV path planning.
comment: Submitted to an IEEE Conference
♻ ☆ Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling
The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT) fail to instill robust reasoning, and traditional Reinforcement Learning (RL) struggles with inefficient exploration. We propose \textbf{EGPO}, a new RL framework built upon Group Relative Policy Optimization (GRPO), designed to address this challenge directly. The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient computation. This encourages the generation of diverse reasoning strategies. To maintain optimization direction, the entropy bonus is carefully constrained by a clipping mechanism. Complemented by a strict, binary reward signal, EGPO effectively guides the model towards discovering structured and accurate tool invocation patterns. On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B-parameter model trained with EGPO sets a new state-of-the-art among models of comparable size, surpassing a range of strong competitors, including GPT-4o and Gemini-2.5.
♻ ☆ Synthetic Series-Symbol Data Generation for Time Series Foundation Models NeurIPS 2025
Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.
comment: 57 pages, 25 figures, 35 tables, NeurIPS 2025 accepted
♻ ☆ Interpretable Machine Learning for Urban Heat Mitigation: Attribution and Weighting of Multi-Scale Drivers
Urban heat islands (UHIs) are often accentuated during heat waves (HWs) and pose a public health risk. Mitigating UHIs requires urban planners to first estimate how urban heat is influenced by different land use types (LUTs) and drivers across scales - from synoptic-scale climatic background processes to small-scale urban- and scale-bridging features. This study proposes to classify these drivers into driving (D), urban (U), and local (L) features, respectively. To increase interpretability and enhance computation efficiency, a LUT-distinguishing machine learning approach is proposed as a fast emulator for Weather Research and Forecasting model (WRF) coupled to the Noah land surface model (LSM) to predict ground- (TSK) and 2-meter air temperature (T2). Using random forest regression (RFR) with extreme gradient boosting (XGB) trained on WRF output over Zurich, Switzerland, during heatwave (HW) periods in 2017 and 2019, this study proposes LUT-based (LB) models that categorize features by scales and practical controllability, allowing optional categorical weighting. This approach enables category-specific feature ranking and sensitivity estimation of T2 and TSK to most important small-scale drivers - most notably surface emissivity, albedo, and leaf area index (LAI). Models employing the LB framework are statistically significantly more accurate than models that do not, with higher performance when more HW data is included in training. With RFR-XGB robustly performing optimal with unit weights, the method substantially increase interpretability. Despite the needs to reduce uncertainties and test the method on other cities, the proposed approach offers urban planners a direct framework for feasibility-centered UHI mitigation assessment.
♻ ☆ Explaining Models under Multivariate Bernoulli Distribution via Hoeffding Decomposition
Explaining the behavior of predictive models with random inputs can be achieved through sub-models decomposition, where such sub-models have easier interpretable features. Arising from the uncertainty quantification community, recent results have demonstrated the existence and uniqueness of a generalized Hoeffding decomposition for such predictive models when the stochastic input variables are correlated, based on concepts of oblique projection onto L 2 subspaces. This article focuses on the case where the input variables have Bernoulli distributions and provides a complete description of this decomposition. We show that in this case the underlying L 2 subspaces are one-dimensional and that the functional decomposition is explicit. This leads to a complete interpretability framework and theoretically allows reverse engineering. Explicit indicators of the influence of inputs on the output prediction (exemplified by Sobol' indices and Shapley effects) can be explicitly derived. Illustrated by numerical experiments, this type of analysis proves useful for addressing decision-support problems, based on binary decision diagrams, Boolean networks or binary neural networks. The article outlines perspectives for exploring high-dimensional settings and, beyond the case of binary inputs, extending these findings to models with finite countable inputs.
♻ ☆ Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm NeurIPS 2025
Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.
comment: Accepted to NeurIPS 2025. Title used at submission and review: PIRO: Toward Stable Reward Learning for Inverse RL via Monotonic Policy Divergence Reduction
♻ ☆ AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking in Large Language Models
LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work 'well enough' across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.
♻ ☆ Latent Variable Modeling in Multi-Agent Reinforcement Learning via Expectation-Maximization for UAV-Based Wildlife Protection
Protecting endangered wildlife from illegal poaching presents a critical challenge, particularly in vast and partially observable environments where real-time response is essential. This paper introduces a novel Expectation-Maximization (EM) based latent variable modeling approach in the context of Multi-Agent Reinforcement Learning (MARL) for Unmanned Aerial Vehicle (UAV) coordination in wildlife protection. By modeling hidden environmental factors and inter-agent dynamics through latent variables, our method enhances exploration and coordination under uncertainty.We implement and evaluate our EM-MARL framework using a custom simulation involving 10 UAVs tasked with patrolling protected habitats of the endangered Iranian leopard. Extensive experimental results demonstrate superior performance in detection accuracy, adaptability, and policy convergence when compared to standard algorithms such as Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG). Our findings underscore the potential of combining EM inference with MARL to improve decentralized decisionmaking in complex, high-stakes conservation scenarios. The full implementation, simulation environment, and training scripts are publicly available on GitHub.
♻ ☆ MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation
Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design. Code are available at \href{https://github.com/Eurekashen/MolSpectLLM}{https://github.com/Eurekashen/MolSpectLLM}.
♻ ☆ 6G-Enabled Digital Twin Framework for Real-Time Cyber-Physical Systems: An Experimental Validation with Industrial Bearing Fault Detection
Current Cyber-Physical Systems (CPS) integrated with Digital Twin (DT) technology face critical limitations in achieving real-time performance for mission-critical industrial applications. Existing 5G-enabled systems suffer from latencies exceeding 10ms, which are inadequate for applications requiring sub-millisecond response times, such as autonomous industrial control and predictive maintenance. This research aims to develop and validate a 6G-enabled Digital Twin framework that achieves ultra-low latency communication and real-time synchronization between physical industrial assets and their digital counterparts, specifically targeting bearing fault detection as a critical industrial use case. The proposed framework integrates terahertz communications (0.1-1 THz), intelligent reflecting surfaces, and edge artificial intelligence within a five-layer architecture. Experimental validation was conducted using the Case Western Reserve University (CWRU) bearing dataset, implementing comprehensive feature extraction (15 time and frequency domain features) and Random Forest classification algorithms. The system performance was evaluated against traditional WiFi-6 and 5G networks across multiple metrics, including classification accuracy, end-to-end latency, and scalability. It achieved 97.7% fault classification accuracy with 0.8ms end-to-end latency, representing a 15.6x improvement over WiFi-6 (12.5ms) and 5.25x improvement over 5G (4.2ms) networks. The system demonstrated superior scalability with sub-linear processing time growth and maintained consistent performance across four bearing fault categories (normal, inner race, outer race, and ball faults) with macro-averaged F1-scores exceeding 97%.
comment: We no longer stand by the results as presented and prefer to retract the work publicly rather than allow continued citation of an invalid claim. We apologize for any inconvenience to the community. A corrected or substantially revised version may be submitted later under a new identifier
♻ ☆ Direct Quantized Training of Language Models with Stochastic Rounding ACML 2025
Although recent quantized Large Language Models (LLMs), such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estimation during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.
comment: Accepted to ACML 2025
♻ ☆ Network Dynamics-Based Framework for Understanding Deep Neural Networks
Advancements in artificial intelligence call for a deeper understanding of the fundamental mechanisms underlying deep learning. In this work, we propose a theoretical framework to analyze learning dynamics through the lens of dynamical systems theory. We redefine the notions of linearity and nonlinearity in neural networks by introducing two fundamental transformation units at the neuron level: order-preserving transformations and non-order-preserving transformations. Different transformation modes lead to distinct collective behaviors in weight vector organization, different modes of information extraction, and the emergence of qualitatively different learning phases. Transitions between these phases may occur during training, accounting for key phenomena such as grokking. To further characterize generalization and structural stability, we introduce the concept of attraction basins in both sample and weight spaces. The distribution of neurons with different transformation modes across layers, along with the structural characteristics of the two types of attraction basins, forms a set of core metrics for analyzing the performance of learning models. Hyperparameters such as depth, width, learning rate, and batch size act as control variables for fine-tuning these metrics. Our framework not only sheds light on the intrinsic advantages of deep learning, but also provides a novel perspective for optimizing network architectures and training strategies.
comment: 13 pages, 7 figures
♻ ☆ Prototype Learning to Create Refined Interpretable Digital Phenotypes from ECGs
Prototype-based neural networks offer interpretable predictions by comparing inputs to learned, representative signal patterns anchored in training data. While such models have shown promise in the classification of physiological data, it remains unclear whether their prototypes capture an underlying structure that aligns with broader clinical phenotypes. We use a prototype-based deep learning model trained for multi-label ECG classification using the PTB-XL dataset. Then without modification we performed inference on the MIMIC-IV clinical database. We assess whether individual prototypes, trained solely for classification, are associated with hospital discharge diagnoses in the form of phecodes in this external population. Individual prototypes demonstrate significantly stronger and more specific associations with clinical outcomes compared to the classifier's class predictions, NLP-extracted concepts, or broader prototype classes across all phecode categories. Prototype classes with mixed significance patterns exhibit significantly greater intra-class distances (p $<$ 0.0001), indicating the model learned to differentiate clinically meaningful variations within diagnostic categories. The prototypes achieve strong predictive performance across diverse conditions, with AUCs ranging from 0.89 for atrial fibrillation to 0.91 for heart failure, while also showing substantial signal for non-cardiac conditions such as sepsis and renal disease. These findings suggest that prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.
comment: Accepted (oral) to the 31st Pacific Symposium on Biocomputing
♻ ☆ Prompt-Aware Scheduling for Low-Latency LLM Serving
Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.
♻ ☆ Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints
We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.
♻ ☆ Symmetry in Neural Network Parameter Spaces
Modern deep learning models are highly overparameterized, resulting in large sets of parameter configurations that yield the same outputs. A significant portion of this redundancy is explained by symmetries in the parameter space--transformations that leave the network function unchanged. These symmetries shape the loss landscape and constrain learning dynamics, offering a new lens for understanding optimization, generalization, and model complexity that complements existing theory of deep learning. This survey provides an overview of parameter space symmetry. We summarize existing literature, uncover connections between symmetry and learning theory, and identify gaps and opportunities in this emerging field.
comment: 30 pages, 9 figures
♻ ☆ An Imitative Reinforcement Learning Framework for Pursuit-Lock-Launch Missions
Unmanned Combat Aerial Vehicle (UCAV) Within-Visual-Range (WVR) engagement, referring to a fight between two or more UCAVs at close quarters, plays a decisive role on the aerial battlefields. With the development of artificial intelligence, WVR engagement progressively advances towards intelligent and autonomous modes. However, autonomous WVR engagement policy learning is hindered by challenges such as weak exploration capabilities, low learning efficiency, and unrealistic simulated environments. To overcome these challenges, we propose a novel imitative reinforcement learning framework, which efficiently leverages expert data while enabling autonomous exploration. The proposed framework not only enhances learning efficiency through expert imitation, but also ensures adaptability to dynamic environments via autonomous exploration with reinforcement learning. Therefore, the proposed framework can learn a successful policy of `pursuit-lock-launch' for UCAVs. To support data-driven learning, we establish an environment based on the Harfang3D sandbox. The extensive experiment results indicate that the proposed framework excels in this multistage task, and significantly outperforms state-of-the-art reinforcement learning and imitation learning methods. Thanks to the ability of imitating experts and autonomous exploration, our framework can quickly learn the critical knowledge in complex aerial combat tasks, achieving up to a 100% success rate and demonstrating excellent robustness.
♻ ☆ SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in iterative, context-rich workflows characteristic of SE activities. To address this limitation, we introduce \emph{SWE-Arena}, an interactive platform designed to evaluate FMs in SE tasks. SWE-Arena provides a transparent, open-source leaderboard, supports multi-round conversational workflows, and enables end-to-end model comparisons. The platform introduces novel metrics, including \emph{model consistency score} that measures the consistency of model outputs through self-play matches, and \emph{conversation efficiency index} that evaluates model performance while accounting for the number of interaction rounds required to reach conclusions. Moreover, SWE-Arena incorporates a new feature called \emph{RepoChat}, which automatically injects repository-related context (e.g., issues, commits, pull requests) into the conversation, further aligning evaluations with real-world development processes. This paper outlines the design and capabilities of SWE-Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.
comment: Check the online arena at https://huggingface.co/spaces/SWE-Arena/Software-Engineering-Arena
♻ ☆ Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.
♻ ☆ Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning
Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.
comment: 7 pages, 6 figures, under review
♻ ☆ Unraveling the Black Box of Neural Networks: A Dynamic Extremum Mapper
We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the extrema of the model function. We further prove that the number of extrema in a neural network is positively correlated with the number of its parameters. We then propose a new algorithm that is significantly different from back-propagation algorithm, which mainly obtains the values of parameters by solving a system of linear equations. Some difficult situations, such as gradient vanishing and overfitting, can be simply explained and dealt with in this framework.
comment: for understanding the common principles of neural networks, e.g., large language models
♻ ☆ Generalized Probabilistic Approximate Optimization Algorithm
We introduce a generalized \textit{Probabilistic Approximate Optimization Algorithm (PAOA)}, a classical variational Monte Carlo framework that extends and formalizes prior work by Weitz \textit{et al.}~\cite{Combes_2023}, enabling parameterized and fast sampling on present-day Ising machines and probabilistic computers. PAOA operates by iteratively modifying the couplings of a network of binary stochastic units, guided by cost evaluations from independent samples. We establish a direct correspondence between derivative-free updates and the gradient of the full Markov flow over the exponentially large state space, showing that PAOA admits a principled variational formulation. Simulated annealing emerges as a limiting case under constrained parameterizations, and we implement this regime on an FPGA-based probabilistic computer with on-chip annealing to solve large 3D spin-glass problems. Benchmarking PAOA against QAOA on the canonical 26-spin Sherrington-Kirkpatrick model with matched parameters reveals superior performance for PAOA. We show that PAOA naturally extends simulated annealing by optimizing multiple temperature profiles, leading to improved performance over SA on heavy-tailed problems such as SK-L\'evy.
♻ ☆ GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploration
Balancing multiple objectives is critical for user satisfaction in modern recommender and search systems, yet current Multi-Task Fusion (MTF) methods rely on static, manually-tuned weights that fail to capture individual user intent. While Reinforcement Learning (RL) offers a path to personalization, traditional approaches often falter due to training instability and the sparse rewards inherent in these large-scale systems. To address these limitations, we propose Group-relative Reinforcement learning with Adaptive Dirichlet Exploration (GRADE), a novel and robust framework for personalized multi-task fusion. GRADE leverages a critic-free, Group Relative Policy Optimization (GRPO) paradigm, enabling stable and efficient policy learning by evaluating the relative performance of candidate weight groups. Its core innovations include employing the Dirichlet distribution for principled and structured exploration of the weight space, and a composite reward function that combines sparse user feedback with dense model priors and rule-based constraints to guide the search effectively. Deployed in the in-app marketplace of an application with over hundreds of millions daily active users, GRADE significantly outperforms established baselines, achieving substantial gains in rigorous large-scale A/B tests: +0.595\% in CTR, +1.193\% in CVR, +1.788\% in OPM, and +1.568\% in total order volume. Following its strong performance, GRADE has been fully deployed in the marketplace search scenario of Kuaishou, serving hundreds of millions of users.
♻ ☆ Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning NeurIPS 2025
Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks, but fine-tuning them for domain-specific applications often requires substantial domain-specific data that may be distributed across multiple organizations. Federated Learning (FL) offers a privacy-preserving solution, but faces challenges with computational constraints when applied to LLMs. Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient fine-tuning approach, though a single LoRA module often struggles with heterogeneous data across diverse domains. This paper addresses two critical challenges in federated LoRA fine-tuning: 1. determining the optimal number and allocation of LoRA experts across heterogeneous clients, and 2. enabling clients to selectively utilize these experts based on their specific data characteristics. We propose FedLEASE (Federated adaptive LoRA Expert Allocation and SElection), a novel framework that adaptively clusters clients based on representation similarity to allocate and train domain-specific LoRA experts. It also introduces an adaptive top-$M$ Mixture-of-Experts mechanism that allows each client to select the optimal number of utilized experts. Our extensive experiments on diverse benchmark datasets demonstrate that FedLEASE significantly outperforms existing federated fine-tuning approaches in heterogeneous client settings while maintaining communication efficiency.
comment: Accepted to NeurIPS 2025
♻ ☆ RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation ICCV 2025
Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
comment: Accept by ICCV 2025 (Highlight). Project website: https://colorful-liyu.github.io/RAGDiffusion-page/
♻ ☆ VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones
Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.
comment: 19 pages
♻ ☆ Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models
To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an "imagination-based" environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search-based planning algorithm that leverages non-differentiable domain metrics as reward signals to explore high-return future sequences. These identified high-reward candidates then serve as pseudo-labels to continuously optimize the agent's policy through iterative self-training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.
♻ ☆ On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations.
♻ ☆ SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening
Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a $(1+O(\epsilon))$ factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.
comment: 12 pages, 5 figures, Code Available: https://doi.org/10.5281/zenodo.17223405
♻ ☆ ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection ICLR24
Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.
comment: ICLR24 poster
♻ ☆ Large Language Model Agent for Modular Task Execution in Drug Discovery
We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.
♻ ☆ Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants
Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.
comment: S.G.-F., M.B., and A.E. contributed equally to this work and share first authorship. M.Z. and P.F. contributed equally to this work and share senior authorship
♻ ☆ Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
comment: 16 pages, 6 figures
♻ ☆ What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context
Sequential recommendation systems aspire to profile users by interpreting their interaction histories, echoing how humans make decisions by weighing experience, relative preference strength, and situational relevance. Yet, existing large language model (LLM)-based recommenders often fall short of mimicking the flexible, context-aware decision strategies humans exhibit, neglecting the structured, dynamic, and context-aware mechanisms fundamental to human behaviors. To bridge this gap, we propose RecPO, a preference optimization framework that models structured feedback and contextual delay to emulate human-like prioritization in sequential recommendation. RecPO exploits adaptive reward margins based on inferred preference hierarchies and temporal signals, enabling the model to favor immediately relevant items and to distinguish between varying degrees of preference and aversion. Extensive experiments across five real-world datasets demonstrate that RecPO not only yields performance gains over state-of-the-art baselines, but also mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
♻ ☆ EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework
Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9\% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.
♻ ☆ E-ICL: Enhancing Fine-Grained Emotion Recognition through the Lens of Prototype Theory
In-context learning (ICL) achieves remarkable performance in various domains such as knowledge acquisition, commonsense reasoning, and semantic understanding. However, its performance significantly deteriorates for emotion detection tasks, especially fine-grained emotion recognition. The underlying reasons for this remain unclear. In this paper, we identify the reasons behind ICL's poor performance from the perspective of prototype theory and propose a method to address this issue. Specifically, we conduct extensive pilot experiments and find that ICL conforms to the prototype theory on fine-grained emotion recognition. Based on this theory, we uncover the following deficiencies in ICL: (1) It relies on prototypes (example-label pairs) that are semantically similar but emotionally inaccurate to predict emotions. (2) It is prone to interference from irrelevant categories, affecting the accuracy and robustness of the predictions. To address these issues, we propose an Emotion Context Learning method (E-ICL) on fine-grained emotion recognition. E-ICL relies on more emotionally accurate prototypes to predict categories by referring to emotionally similar examples with dynamic labels. Simultaneously, E-ICL employs an exclusionary emotion prediction strategy to avoid interference from irrelevant categories, thereby increasing its accuracy and robustness. Note that the entire process is accomplished with the assistance of a plug-and-play emotion auxiliary model, without additional training. Experiments on the fine-grained emotion datasets EDOS, Empathetic-Dialogues, EmpatheticIntent, and GoEmotions show that E-ICL achieves superior emotion prediction performance. Furthermore, even when the emotion auxiliary model used is lower than 10% of the LLMs, E-ICL can still boost the performance of LLMs by over 4% on multiple datasets.
comment: This paper has been superseded by an updated version titled Fine-Grained Emotion Recognition via In-Context Learning (arXiv:2510.06600)
♻ ☆ A Constructive Framework for Nondeterministic Automata via Time-Shared, Depth-Unrolled Feedforward Networks
We present a formal and constructive simulation framework for nondeterministic finite automata (NFAs) using time-shared, depth-unrolled feedforward networks (TS-FFNs), i.e., acyclic unrolled computations with shared parameters that are functionally equivalent to unrolled recurrent or state-space models. Unlike prior approaches that rely on explicit recurrent architectures or post hoc extraction methods, our formulation symbolically encodes automaton states as binary vectors, transitions as sparse matrix transformations, and nondeterministic branching-including $\varepsilon$-closures-as compositions of shared thresholded updates. We prove that every regular language can be recognized exactly by such a shared-parameter unrolled feedforward network, with parameter count independent of input length. Our construction yields a constructive equivalence between NFAs and neural networks and demonstrates \emph{empirical learnability}: these networks can be trained via gradient descent on supervised acceptance data to recover the target automaton behavior. This learnability, formalized in Proposition 5.1, is the crux of this work. Extensive experiments validate the theoretical results, achieving perfect or near-perfect agreement on acceptance, state propagation, and closure dynamics. This work clarifies the correspondence between automata theory and modern neural architectures, showing that unrolled feedforward networks can perform precise, interpretable, and trainable symbolic computation.
comment: 24 pages, with proofs, 8 figures and example in Appendix. IEEE Access Submission
♻ ☆ Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning
We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while maintaining communication efficiency. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.
comment: 8 pages main body, 33 pages of supplementary material for hyperparameter configurations, full proofs of theorems and additional results
Computer Vision and Pattern Recognition 218
☆ ReSplat: Learning Recurrent Gaussian Splats
While feed-forward Gaussian splatting models provide computational efficiency and effectively handle sparse input settings, their performance is fundamentally limited by the reliance on a single forward pass during inference. We propose ReSplat, a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicitly computing gradients. Our key insight is that the Gaussian splatting rendering error serves as a rich feedback signal, guiding the recurrent network to learn effective Gaussian updates. This feedback signal naturally adapts to unseen data distributions at test time, enabling robust generalization. To initialize the recurrent process, we introduce a compact reconstruction model that operates in a $16 \times$ subsampled space, producing $16 \times$ fewer Gaussians than previous per-pixel Gaussian models. This substantially reduces computational overhead and allows for efficient Gaussian updates. Extensive experiments across varying of input views (2, 8, 16), resolutions ($256 \times 256$ to $540 \times 960$), and datasets (DL3DV and RealEstate10K) demonstrate that our method achieves state-of-the-art performance while significantly reducing the number of Gaussians and improving the rendering speed. Our project page is at https://haofeixu.github.io/resplat/.
comment: Project page: https://haofeixu.github.io/resplat/
☆ Scalable Offline Metrics for Autonomous Driving
Real-World evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e., by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is understudied, particularly across diverse closed-loop metrics and complex urban maneuvers. In this work, we revisit this undervalued question in policy evaluation through an extensive set of experiments across diverse conditions and metrics. Based on analysis in simulation, we find an even worse correlation between offline and online settings than reported by prior studies, casting doubts on the validity of current evaluation practices and metrics for driving policies. Next, we bridge the gap between offline and online evaluation. We investigate an offline metric based on epistemic uncertainty, which aims to capture events that are likely to cause errors in closed-loop settings. The resulting metric achieves over 13% improvement in correlation compared to previous offline metrics. We further validate the generalization of our findings beyond the simulation environment in real-world settings, where even greater gains are observed.
comment: Accepted at IROS 2025 (IEEE/RSJ International Conference on Intelligent Robots and Systems)
☆ NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos
Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Project website: https://novaflow.lhy.xyz/.
☆ MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
☆ D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction
Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework D$^2$GS, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The project page can be found at: https://insta360-research-team.github.io/DDGS-website/.
☆ How to Teach Large Multimodal Models New Skills
How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL
comment: In submission. Code is available at https://github.com/jessemelpolio/LMM_CL
☆ NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints NeurIPS 2025
Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.
comment: Accepted by NeurIPS 2025. 22 pages, link: https://github.com/OpenGVLab/NaViL
☆ ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of causal inference, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task to predict the residual deviation from a deterministic inertial reference. The inertial reference serves as a counterfactual, forcing the model to move beyond simple pattern recognition and instead identify the underlying causal factors (e.g., traffic rules, obstacles) that necessitate deviations from a default, inertially-guided path. To deal with the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. It re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. Extensive experiments validate the effectiveness of our framework. On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy with only two denoising steps, demonstrating that our approach significantly simplifies the learning task and improves model performance. The code will be released to facilitate further research.
☆ MultiCOIN: Multi-Modal COntrollable Video INbetweening
Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
comment: Project website: https://multicoinx.github.io/multicoin/
☆ SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
☆ DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model
Achieving generalized in-hand object rotation remains a significant challenge in robotics, largely due to the difficulty of transferring policies from simulation to the real world. The complex, contact-rich dynamics of dexterous manipulation create a "reality gap" that has limited prior work to constrained scenarios involving simple geometries, limited object sizes and aspect ratios, constrained wrist poses, or customized hands. We address this sim-to-real challenge with a novel framework that enables a single policy, trained in simulation, to generalize to a wide variety of objects and conditions in the real world. The core of our method is a joint-wise dynamics model that learns to bridge the reality gap by effectively fitting limited amount of real-world collected data and then adapting the sim policy's actions accordingly. The model is highly data-efficient and generalizable across different whole-hand interaction distributions by factorizing dynamics across joints, compressing system-wide influences into low-dimensional variables, and learning each joint's evolution from its own dynamic profile, implicitly capturing these net effects. We pair this with a fully autonomous data collection strategy that gathers diverse, real-world interaction data with minimal human intervention. Our complete pipeline demonstrates unprecedented generality: a single policy successfully rotates challenging objects with complex shapes (e.g., animals), high aspect ratios (up to 5.33), and small sizes, all while handling diverse wrist orientations and rotation axes. Comprehensive real-world evaluations and a teleoperation application for complex tasks validate the effectiveness and robustness of our approach. Website: https://meowuu7.github.io/DexNDM/
comment: Project Website: https://meowuu7.github.io/DexNDM/ Video: https://youtu.be/tU2Mv8vWftU
☆ VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
comment: Project page: https://onevfall.github.io/project_page/videocanvas
☆ Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.
comment: 14 pages, 6 figures, 13 tables
☆ ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.
☆ R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation
Towards the aim of generalized robotic manipulation, spatial generalization is the most fundamental capability that requires the policy to work robustly under different spatial distribution of objects, environment and agent itself. To achieve this, substantial human demonstrations need to be collected to cover different spatial configurations for training a generalized visuomotor policy via imitation learning. Prior works explore a promising direction that leverages data generation to acquire abundant spatially diverse data from minimal source demonstrations. However, most approaches face significant sim-to-real gap and are often limited to constrained settings, such as fixed-base scenarios and predefined camera viewpoints. In this paper, we propose a real-to-real 3D data generation framework (R2RGen) that directly augments the pointcloud observation-action pairs to generate real-world data. R2RGen is simulator- and rendering-free, thus being efficient and plug-and-play. Specifically, given a single source demonstration, we introduce an annotation mechanism for fine-grained parsing of scene and trajectory. A group-wise augmentation strategy is proposed to handle complex multi-object compositions and diverse task constraints. We further present camera-aware processing to align the distribution of generated data with real-world 3D sensor. Empirically, R2RGen substantially enhances data efficiency on extensive experiments and demonstrates strong potential for scaling and application on mobile manipulation.
comment: Project page: https://r2rgen.github.io/
☆ VideoNorms: Benchmarking Cultural Awareness of Video Language Models
As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.
comment: 24 pages, 5 figures, under review
☆ MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
☆ Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.
comment: Project Page: https://snap-research.github.io/kontinuouskontext/
☆ SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
comment: Project Page: https://zju-real.github.io/SpatialLadder/ Code: https://github.com/ZJU-REAL/SpatialLadder
☆ X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering
We present X2Video, the first diffusion model for rendering photorealistic videos guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance, while supporting intuitive multi-modal controls with reference images and text prompts for both global and local regions. The intrinsic guidance allows accurate manipulation of color, material, geometry, and lighting, while reference images and text prompts provide intuitive adjustments in the absence of intrinsic information. To enable these functionalities, we extend the intrinsic-guided image generation model XRGB to video generation by employing a novel and efficient Hybrid Self-Attention, which ensures temporal consistency across video frames and also enhances fidelity to reference images. We further develop a Masked Cross-Attention to disentangle global and local text prompts, applying them effectively onto respective local and global regions. For generating long videos, our novel Recursive Sampling method incorporates progressive frame sampling, combining keyframe prediction and frame interpolation to maintain long-range temporal consistency while preventing error accumulation. To support the training of X2Video, we assembled a video dataset named InteriorVideo, featuring 1,154 rooms from 295 interior scenes, complete with reliable ground-truth intrinsic channel sequences and smooth camera trajectories. Both qualitative and quantitative evaluations demonstrate that X2Video can produce long, temporally consistent, and photorealistic videos guided by intrinsic conditions. Additionally, X2Video effectively accommodates multi-modal controls with reference images, global and local text prompts, and simultaneously supports editing on color, material, geometry, and lighting through parametric tuning. Project page: https://luckyhzt.github.io/x2video
comment: Code, model, and dataset will be released at project page soon: https://luckyhzt.github.io/x2video
☆ FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.
comment: Project Page: https://bestzzhang.github.io/FlexTraj
☆ SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks
This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.
☆ Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression
Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.
comment: Accepted for publication in IEEE Robotics and Automation Letters (RA-L). 8 pages, 6 figures
☆ To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.
comment: Preprint. Project page: https://davidhalladay.github.io/diysink_demo
☆ MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.
☆ AI-Driven Radiology Report Generation for Traumatic Brain Injuries
Traumatic brain injuries present significant diagnostic challenges in emergency medicine, where the timely interpretation of medical images is crucial for patient outcomes. In this paper, we propose a novel AI-based approach for automatic radiology report generation tailored to cranial trauma cases. Our model integrates an AC-BiFPN with a Transformer architecture to capture and process complex medical imaging data such as CT and MRI scans. The AC-BiFPN extracts multi-scale features, enabling the detection of intricate anomalies like intracranial hemorrhages, while the Transformer generates coherent, contextually relevant diagnostic reports by modeling long-range dependencies. We evaluate the performance of our model on the RSNA Intracranial Hemorrhage Detection dataset, where it outperforms traditional CNN-based models in both diagnostic accuracy and report generation. This solution not only supports radiologists in high-pressure environments but also provides a powerful educational tool for trainee physicians, offering real-time feedback and enhancing their learning experience. Our findings demonstrate the potential of combining advanced feature extraction with transformer-based text generation to improve clinical decision-making in the diagnosis of traumatic brain injuries.
☆ Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/
comment: 63 pages, 29 tables, and 47 figures
☆ Splat the Net: Radiance Fields with Splattable Neural Primitives
Radiance fields have emerged as a predominant representation for modeling 3D scene appearance. Neural formulations such as Neural Radiance Fields provide high expressivity but require costly ray marching for rendering, whereas primitive-based methods such as 3D Gaussian Splatting offer real-time efficiency through splatting, yet at the expense of representational power. Inspired by advances in both these directions, we introduce splattable neural primitives, a new volumetric representation that reconciles the expressivity of neural models with the efficiency of primitive-based splatting. Each primitive encodes a bounded neural density field parameterized by a shallow neural network. Our formulation admits an exact analytical solution for line integrals, enabling efficient computation of perspectively accurate splatting kernels. As a result, our representation supports integration along view rays without the need for costly ray marching. The primitives flexibly adapt to scene geometry and, being larger than prior analytic primitives, reduce the number required per scene. On novel-view synthesis benchmarks, our approach matches the quality and speed of 3D Gaussian Splatting while using $10\times$ fewer primitives and $6\times$ fewer parameters. These advantages arise directly from the representation itself, without reliance on complex control or adaptation frameworks. The project page is https://vcai.mpi-inf.mpg.de/projects/SplatNet/.
☆ InstructX: Towards Unified Visual Editing with MLLM Guidance
With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.
☆ The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on \textit{transparency}, they are far from human baselines; and only top models correlate moderately with human \textit{iconicity ratings}. Interestingly, \textit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
☆ Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
☆ DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos
We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.
comment: Video results are available at: https://embodiedai-ntu.github.io/dexman/index.html
☆ Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction
This study introduces a modular framework for spatial image processing, integrating grayscale quantization, color and brightness enhancement, image sharpening, bidirectional transformation pipelines, and geometric feature extraction. A stepwise intensity transformation quantizes grayscale images into eight discrete levels, producing a posterization effect that simplifies representation while preserving structural detail. Color enhancement is achieved via histogram equalization in both RGB and YCrCb color spaces, with the latter improving contrast while maintaining chrominance fidelity. Brightness adjustment is implemented through HSV value-channel manipulation, and image sharpening is performed using a 3 * 3 convolution kernel to enhance high-frequency details. A bidirectional transformation pipeline that integrates unsharp masking, gamma correction, and noise amplification achieved accuracy levels of 76.10% and 74.80% for the forward and reverse processes, respectively. Geometric feature extraction employed Canny edge detection, Hough-based line estimation (e.g., 51.50{\deg} for billiard cue alignment), Harris corner detection, and morphological window localization. Cue isolation further yielded 81.87\% similarity against ground truth images. Experimental evaluation across diverse datasets demonstrates robust and deterministic performance, highlighting its potential for real-time image analysis and computer vision.
comment: There are 14 pages journal paper
☆ Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.
comment: Project page: https://andrewcwlee.github.io/gaze-on-the-prize
☆ Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
☆ Reinforcing Diffusion Models by Direct Group Preference Optimization
While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.
☆ Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin
The mechanosensory system of teeth is currently believed to partly rely on Odontoblast cells stimulation by fluid flow through a porosity network extending through dentin. Visualizing the smallest sub-microscopic porosity vessels therefore requires the highest achievable resolution from confocal fluorescence microscopy, the current gold standard. This considerably limits the extent of the field of view to very small sample regions. To overcome this limitation, we tested different deep learning (DL) super-resolution (SR) models to allow faster experimental acquisitions of lower resolution images and restore optimal image quality by post-processing. Three supervised 2D SR models (RCAN, pix2pix, FSRCNN) and one unsupervised (CycleGAN) were applied to a unique set of experimentally paired high- and low-resolution confocal images acquired with different sampling schemes, resulting in a pixel size increase of x2, x4, x8. Model performance was quantified using a broad set of similarity and distribution-based image quality assessment (IQA) metrics, which yielded inconsistent results that mostly contradicted our visual perception. This raises the question of the relevance of such generic metrics to efficiently target the specific structure of dental porosity. To resolve this conflicting information, the generated SR images were segmented taking into account the specific scales and morphology of the porosity network and analysed by comparing connected components. Additionally, the capacity of the SR models to preserve 3D porosity connectivity throughout the confocal image stacks was evaluated using graph analysis. This biology-driven assessment allowed a far better mechanistic interpretation of SR performance, highlighting differences in model sensitivity to weak intensity features and the impact of non-linearity in image generation, which explains the failure of standard IQA metrics.
VideoVerse: How Far is Your T2V Generator from a World Model?
The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.
comment: 24 Pages, 8 Figures, 11 Tables
Spectral Prefiltering of Neural Fields
Neural fields excel at representing continuous visual signals but typically operate at a single, fixed resolution. We present a simple yet powerful method to optimize neural fields that can be prefiltered in a single forward pass. Key innovations and features include: (1) We perform convolutional filtering in the input domain by analytically scaling Fourier feature embeddings with the filter's frequency response. (2) This closed-form modulation generalizes beyond Gaussian filtering and supports other parametric filters (Box and Lanczos) that are unseen at training time. (3) We train the neural field using single-sample Monte Carlo estimates of the filtered signal. Our method is fast during both training and inference, and imposes no additional constraints on the network architecture. We show quantitative and qualitative improvements over existing methods for neural-field filtering.
comment: 16 pages, 10 figures, to be published in Siggraph Asia 2025, Website: https://myaldiz.info/assets/spnf
☆ Robust Source-Free Domain Adaptation for Medical Image Segmentation based on Curriculum Learning
Recent studies have uncovered a new research line, namely source-free domain adaptation, which adapts a model to target domains without using the source data. Such a setting can address the concerns on data privacy and security issues of medical images. However, current source-free domain adaptation frameworks mainly focus on the pseudo label refinement for target data without the consideration of learning procedure. Indeed, a progressive learning process from source to target domain will benefit the knowledge transfer during model adaptation. To this end, we propose a curriculum-based framework, namely learning from curriculum (LFC), for source-free domain adaptation, which consists of easy-to-hard and source-to-target curricula. Concretely, the former curriculum enables the framework to start learning with `easy' samples and gradually tune the optimization direction of model adaption by increasing the sample difficulty. While, the latter can stablize the adaptation process, which ensures smooth transfer of the model from the source domain to the target. We evaluate the proposed source-free domain adaptation approach on the public cross-domain datasets for fundus segmentation and polyp segmentation. The extensive experimental results show that our framework surpasses the existing approaches and achieves a new state-of-the-art.
☆ Detecting Legend Items on Historical Maps Using GPT-4o with In-Context Learning
Historical map legends are critical for interpreting cartographic symbols. However, their inconsistent layouts and unstructured formats make automatic extraction challenging. Prior work focuses primarily on segmentation or general optical character recognition (OCR), with few methods effectively matching legend symbols to their corresponding descriptions in a structured manner. We present a method that combines LayoutLMv3 for layout detection with GPT-4o using in-context learning to detect and link legend items and their descriptions via bounding box predictions. Our experiments show that GPT-4 with structured JSON prompts outperforms the baseline, achieving 88% F-1 and 85% IoU, and reveal how prompt design, example counts, and layout alignment affect performance. This approach supports scalable, layout-aware legend parsing and improves the indexing and searchability of historical maps across various visual styles.
☆ UniVideo: Unified Understanding, Generation, and Editing for Videos
Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
comment: Project Website https://congwei1230.github.io/UniVideo/
☆ Hyperspectral data augmentation with transformer-based diffusion models
The introduction of new generation hyperspectral satellite sensors, combined with advancements in deep learning methodologies, has significantly enhanced the ability to discriminate detailed land-cover classes at medium-large scales. However, a significant challenge in deep learning methods is the risk of overfitting when training networks with small labeled datasets. In this work, we propose a data augmentation technique that leverages a guided diffusion model. To effectively train the model with a limited number of labeled samples and to capture complex patterns in the data, we implement a lightweight transformer network. Additionally, we introduce a modified weighted loss function and an optimized cosine variance scheduler, which facilitate fast and effective training on small datasets. We evaluate the effectiveness of the proposed method on a forest classification task with 10 different forest types using hyperspectral images acquired by the PRISMA satellite. The results demonstrate that the proposed method outperforms other data augmentation techniques in both average and weighted average accuracy. The effectiveness of the method is further highlighted by the stable training behavior of the model, which addresses a common limitation in the practical application of deep generative models for data augmentation.
comment: 10 pages, 2 figures, accepted at SPIE REMOTE SENSING conference 16-20 September 2024 Edinburgh, United Kingdom
☆ SPICE: Simple and Practical Image Clarification and Enhancement
We introduce a simple and efficient method to enhance and clarify images. More specifically, we deal with low light image enhancement and clarification of hazy imagery (hazy/foggy images, images containing sand dust, and underwater images). Our method involves constructing an image filter to simulate low-light or hazy conditions and deriving approximate reverse filters to minimize distortions in the enhanced images. Experimental results show that our approach is highly competitive and often surpasses state-of-the-art techniques in handling extremely dark images and in enhancing hazy images. A key advantage of our approach lies in its simplicity: Our method is implementable with just a few lines of MATLAB code.
comment: 5 pages, 8 figures
☆ Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
☆ LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
comment: Code will be released upon acceptance
☆ Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.
comment: Work in process
☆ LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation
Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.
comment: Accepted by IEEE TCSVT
☆ Learning Neural Exposure Fields for View Synthesis NeurIPS 2025
Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.
comment: Accepted to NeurIPS 2025. Project page available at https://www.m-niemeyer.github.io/nexf/index.html
☆ A Multimodal Depth-Aware Method For Embodied Reference Understanding
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.
☆ One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting NeurIPS 2025
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.
comment: 25 pages, 11 figures, to appear NeurIPS 2025
☆ SViM3D: Stable Video Material Diffusion for Single Image 3D Generation ICCV 2025
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
comment: Accepted by International Conference on Computer Vision (ICCV 2025). Project page: http://svim3d.aengelhardt.com
☆ Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification
Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism combined with Mixup and a dual exponential moving average (EMA) module for robust pseudo-label generation. To maximize AdaGC's effectiveness, we introduce a simple yet theoretically grounded indicator to adaptively trigger GC after an initial warm-up stage based on training dynamics, thereby guaranteeing the effectiveness of GC in mitigating overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings.
comment: 14 pages, 6 figures
☆ Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction
Human interaction is inherently dynamic and hierarchical, where the dynamic refers to the motion changes with distance, and the hierarchy is from individual to inter-individual and ultimately to overall motion. Exploiting these properties is vital for dual-human motion generation, while existing methods almost model human interaction temporally invariantly, ignoring distance and hierarchy. To address it, we propose a fine-grained dual-human motion generation method, namely FineDual, a tri-stage method to model the dynamic hierarchical interaction from individual to inter-individual. The first stage, Self-Learning Stage, divides the dual-human overall text into individual texts through a Large Language Model, aligning text features and motion features at the individual level. The second stage, Adaptive Adjustment Stage, predicts interaction distance by an interaction distance predictor, modeling human interactions dynamically at the inter-individual level by an interaction-aware graph network. The last stage, Teacher-Guided Refinement Stage, utilizes overall text features as guidance to refine motion features at the overall level, generating fine-grained and high-quality dual-human motion. Extensive quantitative and qualitative evaluations on dual-human motion datasets demonstrate that our proposed FineDual outperforms existing approaches, effectively modeling dynamic hierarchical human interaction.
☆ InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing
Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.
☆ Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data
Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise, hindering model performance. While methods exist for each issue, effectively combining them is non-trivial, as distinguishing genuine tail samples from noisy data proves difficult, often leading to conflicting optimization strategies. This paper presents a novel perspective: instead of primarily developing new complex techniques from scratch, we explore synergistically leveraging well-established, individually 'weak' auxiliary models - specialized for tackling either class imbalance or label noise but not both. This view is motivated by the insight that class imbalance (a distributional-level concern) and label noise (a sample-level concern) operate at different granularities, suggesting that robustness mechanisms for each can in principle offer complementary strengths without conflict. We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights from such 'weak', single-purpose auxiliary models. Specifically, D-SINK uses an optimal transport-optimized surrogate label allocation to align the target model's sample-level predictions with a noise-robust auxiliary and its class distributions with an imbalance-robust one. Extensive experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
comment: 25 pages, 2 figures
☆ Robust Canonicalization through Bootstrapped Data Re-Alignment
Fine-grained visual classification (FGVC) tasks, such as insect and bird identification, demand sensitivity to subtle visual cues while remaining robust to spatial transformations. A key challenge is handling geometric biases and noise, such as different orientations and scales of objects. Existing remedies rely on heavy data augmentation, which demands powerful models, or on equivariant architectures, which constrain expressivity and add cost. Canonicalization offers an alternative by shielding such biases from the downstream model. In practice, such functions are often obtained using canonicalization priors, which assume aligned training data. Unfortunately, real-world datasets never fulfill this assumption, causing the obtained canonicalizer to be brittle. We propose a bootstrapping algorithm that iteratively re-aligns training samples by progressively reducing variance and recovering the alignment assumption. We establish convergence guarantees under mild conditions for arbitrary compact groups, and show on four FGVC benchmarks that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.
☆ NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.
☆ Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing
Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.
comment: 25pages,20figures
☆ UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.
☆ Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.
☆ Real-Time Motion-Controllable Autoregressive Video Diffusion
Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.
☆ Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation
Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians' workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.
comment: 10 pages, 9 figures. This work has been submitted to the IEEE for possible publication
Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting
Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real-world settings.
comment: Accepted to VCIP 2025 (International Conference on Visual Communications and Image Processing 2025)
☆ DarkHash: A Data-Free Backdoor Attack Against Deep Hashing
Benefiting from its superior feature learning capabilities and efficiency, deep hashing has achieved remarkable success in large-scale image retrieval. Recent studies have demonstrated the vulnerability of deep hashing models to backdoor attacks. Although these studies have shown promising attack results, they rely on access to the training dataset to implant the backdoor. In the real world, obtaining such data (e.g., identity information) is often prohibited due to privacy protection and intellectual property concerns. Embedding backdoors into deep hashing models without access to the training data, while maintaining retrieval accuracy for the original task, presents a novel and challenging problem. In this paper, we propose DarkHash, the first data-free backdoor attack against deep hashing. Specifically, we design a novel shadow backdoor attack framework with dual-semantic guidance. It embeds backdoor functionality and maintains original retrieval accuracy by fine-tuning only specific layers of the victim model using a surrogate dataset. We consider leveraging the relationship between individual samples and their neighbors to enhance backdoor attacks during training. By designing a topological alignment loss, we optimize both individual and neighboring poisoned samples toward the target sample, further enhancing the attack capability. Experimental results on four image datasets, five model architectures, and two hashing methods demonstrate the high effectiveness of DarkHash, outperforming existing state-of-the-art backdoor attack methods. Defense experiments show that DarkHash can withstand existing mainstream backdoor defense methods.
comment: Accepted by TIFS 2025
☆ Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection NeurIPS 2025
AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.
comment: Accepted at NeurIPS 2025 spotlight
☆ Towards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery Faces
Deepfakes, leveraging advanced AIGC (Artificial Intelligence-Generated Content) techniques, create hyper-realistic synthetic images and videos of human faces, posing a significant threat to the authenticity of social media. While this real-world threat is increasingly prevalent, existing academic evaluations and benchmarks for detecting deepfake forgery often fall short to achieve effective application for their lack of specificity, limited deepfake diversity, restricted manipulation techniques.To address these limitations, we introduce RedFace (Real-world-oriented Deepfake Face), a specialized facial deepfake dataset, comprising over 60,000 forged images and 1,000 manipulated videos derived from authentic facial features, to bridge the gap between academic evaluations and real-world necessity. Unlike prior benchmarks, which typically rely on academic methods to generate deepfakes, RedFace utilizes 9 commercial online platforms to integrate the latest deepfake technologies found "in the wild", effectively simulating real-world black-box scenarios.Moreover, RedFace's deepfakes are synthesized using bespoke algorithms, allowing it to capture diverse and evolving methods used by real-world deepfake creators. Extensive experimental results on RedFace (including cross-domain, intra-domain, and real-world social network dissemination simulations) verify the limited practicality of existing deepfake detection schemes against real-world applications. We further perform a detailed analysis of the RedFace dataset, elucidating the reason of its impact on detection performance compared to conventional datasets. Our dataset is available at: https://github.com/kikyou-220/RedFace.
☆ A class-driven hierarchical ResNet for classification of multispectral remote sensing images
This work presents a multitemporal class-driven hierarchical Residual Neural Network (ResNet) designed for modelling the classification of Time Series (TS) of multispectral images at different semantical class levels. The architecture consists of a modification of the ResNet where we introduce additional branches to perform the classification at the different hierarchy levels and leverage on hierarchy-penalty maps to discourage incoherent hierarchical transitions within the classification. In this way, we improve the discrimination capabilities of classes at different levels of semantic details and train a modular architecture that can be used as a backbone network for introducing new specific classes and additional tasks considering limited training samples available. We exploit the class-hierarchy labels to train efficiently the different layers of the architecture, allowing the first layers to train faster on the first levels of the hierarchy modeling general classes (i.e., the macro-classes) and the intermediate classes, while using the last ones to discriminate more specific classes (i.e., the micro-classes). In this way, the targets are constrained in following the hierarchy defined, improving the classification of classes at the most detailed level. The proposed modular network has intrinsic adaptation capability that can be obtained through fine tuning. The experimental results, obtained on two tiles of the Amazonian Forest on 12 monthly composites of Sentinel 2 images acquired during 2019, demonstrate the effectiveness of the hierarchical approach in both generalizing over different hierarchical levels and learning discriminant features for an accurate classification at the micro-class level on a new target area, with a better representation of the minoritarian classes.
comment: 11 pages, 2 figures, accepted conference paper at SPIE REMOTE SENSING, 3-7 September 2023, Amsterdam, Netherlands
☆ RetouchLLM: Training-free White-box Image Retouching
Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.
☆ RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans BMVC-2025
Weakly Supervised Anomaly detection (WSAD) in brain MRI scans is an important challenge useful to obtain quick and accurate detection of brain anomalies when precise pixel-level anomaly annotations are unavailable and only weak labels (e.g., slice-level) are available. In this work, we propose RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings, a novel two-stage WSAD framework. In the first stage, we introduce a Discriminative Dual Prompt Tuning (DDPT) mechanism that generates high-quality pseudo weak masks based on slice-level labels, serving as coarse localization cues. In the second stage, we propose a segmentation network with a region-aware spatial attention mechanism that relies on fixed location-based random embeddings. This design enables the model to effectively focus on anomalous regions. Our approach achieves state-of-the-art anomaly detection performance, significantly outperforming existing WSAD methods while utilizing less than 8 million parameters. Extensive evaluations on the BraTS20, BraTS21, BraTS23, and MSD datasets demonstrate a substantial performance improvement coupled with a significant reduction in computational complexity. Code is available at: https://github.com/BheeshmSharma/RASALoRE-BMVC-2025/.
comment: Accepted in BMVC-2025
☆ RayFusion: Ray Fusion Enhanced Collaborative Visual Perception NeurIPS2025
Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.
comment: Accepted by NeurIPS2025
☆ CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning
Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.
☆ GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network
Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at https://github.com/event-driven-robotics/GraphEnet-NeVi-ICCV2025.
☆ Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN
The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model's size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.
comment: 7 pages, 4 figures
☆ The impact of abstract and object tags on image privacy classification
Object tags denote concrete entities and are central to many computer vision tasks, whereas abstract tags capture higher-level information, which is relevant for tasks that require a contextual, potentially subjective scene understanding. Object and abstract tags extracted from images also facilitate interpretability. In this paper, we explore which type of tags is more suitable for the context-dependent and inherently subjective task of image privacy. While object tags are generally used for privacy classification, we show that abstract tags are more effective when the tag budget is limited. Conversely, when a larger number of tags per image is available, object-related information is as useful. We believe that these findings will guide future research in developing more accurate image privacy classifiers, informed by the role of tag types and quantity.
comment: This work has been submitted to the ICASSP 2026
☆ Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement NeurIPS 2025
Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.In Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency reconstruction.Stage Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent structure.At inference, a tunable parameter {\alpha} enables flexible fidelity-perception trade-offs.Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.
comment: Accepted by NeurIPS 2025
☆ SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation ICME 2025
Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.
comment: accepted by ICME 2025
☆ A Large-scale Dataset for Robust Complex Anime Scene Text Detection
Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: https://huggingface.co/datasets/deepghs/AnimeText
☆ CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.
☆ TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
comment: Project page: https://ttom-t2v.github.io/
☆ ASBench: Image Anomalies Synthesis Benchmark for Anomaly Detection
Anomaly detection plays a pivotal role in manufacturing quality control, yet its application is constrained by limited abnormal samples and high manual annotation costs. While anomaly synthesis offers a promising solution, existing studies predominantly treat anomaly synthesis as an auxiliary component within anomaly detection frameworks, lacking systematic evaluation of anomaly synthesis algorithms. Current research also overlook crucial factors specific to anomaly synthesis, such as decoupling its impact from detection, quantitative analysis of synthetic data and adaptability across different scenarios. To address these limitations, we propose ASBench, the first comprehensive benchmarking framework dedicated to evaluating anomaly synthesis methods. Our framework introduces four critical evaluation dimensions: (i) the generalization performance across different datasets and pipelines (ii) the ratio of synthetic to real data (iii) the correlation between intrinsic metrics of synthesis images and anomaly detection performance metrics , and (iv) strategies for hybrid anomaly synthesis methods. Through extensive experiments, ASBench not only reveals limitations in current anomaly synthesis methods but also provides actionable insights for future research directions in anomaly synthesis
☆ MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.
☆ MMM: Quantum-Chemical Molecular Representation Learning for Combinatorial Drug Recommendation MICCAI
Drug recommendation is an essential task in machine learning-based clinical decision support systems. However, the risk of drug-drug interactions (DDI) between co-prescribed medications remains a significant challenge. Previous studies have used graph neural networks (GNNs) to represent drug structures. Regardless, their simplified discrete forms cannot fully capture the molecular binding affinity and reactivity. Therefore, we propose Multimodal DDI Prediction with Molecular Electron Localization Function (ELF) Maps (MMM), a novel framework that integrates three-dimensional (3D) quantum-chemical information into drug representation learning. It generates 3D electron density maps using the ELF. To capture both therapeutic relevance and interaction risks, MMM combines ELF-derived features that encode global electronic properties with a bipartite graph encoder that models local substructure interactions. This design enables learning complementary characteristics of drug molecules. We evaluate MMM in the MIMIC-III dataset (250 drugs, 442 substructures), comparing it with several baseline models. In particular, a comparison with the GNN-based SafeDrug model demonstrates statistically significant improvements in the F1-score (p = 0.0387), Jaccard (p = 0.0112), and the DDI rate (p = 0.0386). These results demonstrate the potential of ELF-based 3D representations to enhance prediction accuracy and support safer combinatorial drug prescribing in clinical practice.
comment: Medical Image Computing and Computer-Assisted Intervention (MICCAI) Predictive Intelligence in Medicine Workshop (MICCAI PRIME) 2025; 13 pages
☆ SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: https://github.com/dllgyufei/SatFusion.git.
☆ FlowLensing: Simulating Gravitational Lensing with Flow Matching
Gravitational lensing is one of the most powerful probes of dark matter, yet creating high-fidelity lensed images at scale remains a bottleneck. Existing tools rely on ray-tracing or forward-modeling pipelines that, while precise, are prohibitively slow. We introduce FlowLensing, a Diffusion Transformer-based compact and efficient flow-matching model for strong gravitational lensing simulation. FlowLensing operates in both discrete and continuous regimes, handling classes such as different dark matter models as well as continuous model parameters ensuring physical consistency. By enabling scalable simulations, our model can advance dark matter studies, specifically for probing dark matter substructure in cosmological surveys. We find that our model achieves a speedup of over 200$\times$ compared to classical simulators for intensive dark matter models, with high fidelity and low inference latency. FlowLensing enables rapid, scalable, and physically consistent image synthesis, offering a practical alternative to traditional forward-modeling pipelines.
comment: 6 pages, 2 figures, 3 tables
☆ Team Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception - Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track
In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent's ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.
☆ XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method
Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the $360^\circ$ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbf{XYZCylinder}, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}.
comment: Project page: https://yuyuyu223.github.io/XYZCYlinder-projectpage/
☆ Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials
High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.
☆ AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views
The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views. Our code is avaliable at https://github.com/MediaX-SJTU/AlignGS .
☆ IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries
Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems-2025.
comment: Accepted in AIML-Systems-2025
☆ PrismGS: Physically-Grounded Anti-Aliasing for High-Fidelity Large-Scale 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has recently enabled real-time photorealistic rendering in compact scenes, but scaling to large urban environments introduces severe aliasing artifacts and optimization instability, especially under high-resolution (e.g., 4K) rendering. These artifacts, manifesting as flickering textures and jagged edges, arise from the mismatch between Gaussian primitives and the multi-scale nature of urban geometry. While existing ``divide-and-conquer'' pipelines address scalability, they fail to resolve this fidelity gap. In this paper, we propose PrismGS, a physically-grounded regularization framework that improves the intrinsic rendering behavior of 3D Gaussians. PrismGS integrates two synergistic regularizers. The first is pyramidal multi-scale supervision, which enforces consistency by supervising the rendering against a pre-filtered image pyramid. This compels the model to learn an inherently anti-aliased representation that remains coherent across different viewing scales, directly mitigating flickering textures. This is complemented by an explicit size regularization that imposes a physically-grounded lower bound on the dimensions of the 3D Gaussians. This prevents the formation of degenerate, view-dependent primitives, leading to more stable and plausible geometric surfaces and reducing jagged edges. Our method is plug-and-play and compatible with existing pipelines. Extensive experiments on MatrixCity, Mill-19, and UrbanScene3D demonstrate that PrismGS achieves state-of-the-art performance, yielding significant PSNR gains around 1.5 dB against CityGaussian, while maintaining its superior quality and robustness under demanding 4K rendering.
☆ MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.
☆ Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation NeurIPS2025
Visual prompting (VP) has emerged as a promising parameter-efficient fine-tuning approach for adapting pre-trained vision models to downstream tasks without modifying model parameters. Despite offering advantages like negligible computational overhead and compatibility with black-box models, conventional VP methods typically achieve lower accuracy than other adaptation approaches. Our analysis reveals two critical limitations: the restricted expressivity of simple additive transformation and a tendency toward overfitting when the parameter count increases. To address these challenges, we propose ACAVP (Affine, Color, and Additive Visual Prompting), which enhances VP's expressive power by introducing complementary transformation operations: affine transformation for creating task-specific prompt regions while preserving original image information, and color transformation for emphasizing task-relevant visual features. Additionally, we identify that overfitting is a critical issue in VP training and introduce TrivialAugment as an effective data augmentation, which not only benefits our approach but also significantly improves existing VP methods, with performance gains of up to 12 percentage points on certain datasets. This demonstrates that appropriate data augmentation is universally beneficial for VP training. Extensive experiments across twelve diverse image classification datasets with two different model architectures demonstrate that ACAVP achieves state-of-the-art accuracy among VP methods, surpasses linear probing in average accuracy, and exhibits superior robustness to distribution shifts, all while maintaining minimal computational overhead during inference.
comment: Accepted to NeurIPS2025
☆ An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images
Predicting spherical pixel depth from monocular $360^{\circ}$ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.
☆ FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition
Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.
☆ GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.
comment: 20 pages, 13 figures
☆ Demystifying Deep Learning-based Brain Tumor Segmentation with 3D UNets and Explainable AI (XAI): A Comparative Analysis
The current study investigated the use of Explainable Artificial Intelligence (XAI) to improve the accuracy of brain tumor segmentation in MRI images, with the goal of assisting physicians in clinical decision-making. The study focused on applying UNet models for brain tumor segmentation and using the XAI techniques of Gradient-weighted Class Activation Mapping (Grad-CAM) and attention-based visualization to enhance the understanding of these models. Three deep learning models - UNet, Residual UNet (ResUNet), and Attention UNet (AttUNet) - were evaluated to identify the best-performing model. XAI was employed with the aims of clarifying model decisions and increasing physicians' trust in these models. We compared the performance of two UNet variants (ResUNet and AttUNet) with the conventional UNet in segmenting brain tumors from the BraTS2020 public dataset and analyzed model predictions with Grad-CAM and attention-based visualization. Using the latest computer hardware, we trained and validated each model using the Adam optimizer and assessed their performance with respect to: (i) training, validation, and inference times, (ii) segmentation similarity coefficients and loss functions, and (iii) classification performance. Notably, during the final testing phase, ResUNet outperformed the other models with respect to Dice and Jaccard similarity scores, as well as accuracy, recall, and F1 scores. Grad-CAM provided visuospatial insights into the tumor subregions each UNet model focused on while attention-based visualization provided valuable insights into the working mechanisms of AttUNet's attention modules. These results demonstrated ResUNet as the best-performing model and we conclude by recommending its use for automated brain tumor segmentation in future clinical assessments. Our source code and checkpoint are available at https://github.com/ethanong98/MultiModel-XAI-Brats2020
☆ IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction
Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms $\pi_0$, achieving 18\% higher success rates with direct instructions and 28\% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40\% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.
☆ DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream
Reconstructing Dynamic 3D Gaussian Splatting (3DGS) from low-framerate RGB videos is challenging. This is because large inter-frame motions will increase the uncertainty of the solution space. For example, one pixel in the first frame might have more choices to reach the corresponding pixel in the second frame. Event cameras can asynchronously capture rapid visual changes and are robust to motion blur, but they do not provide color information. Intuitively, the event stream can provide deterministic constraints for the inter-frame large motion by the event trajectories. Hence, combining low-temporal-resolution images with high-framerate event streams can address this challenge. However, it is challenging to jointly optimize Dynamic 3DGS using both RGB and event modalities due to the significant discrepancy between these two data modalities. This paper introduces a novel framework that jointly optimizes dynamic 3DGS from the two modalities. The key idea is to adopt event motion priors to guide the optimization of the deformation fields. First, we extract the motion priors encoded in event streams by using the proposed LoCM unsupervised fine-tuning framework to adapt an event flow estimator to a certain unseen scene. Then, we present the geometry-aware data association method to build the event-Gaussian motion correspondence, which is the primary foundation of the pipeline, accompanied by two useful strategies, namely motion decomposition and inter-frame pseudo-label. Extensive experiments show that our method outperforms existing image and event-based approaches across synthetic and real scenes and prove that our method can effectively optimize dynamic 3DGS with the help of event data.
comment: Accepted by TVCG
☆ UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes
Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at https://srameo.github.io/projects/ultraled.
☆ ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object's appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object's placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.
☆ SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction NIPS 2025
Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.
comment: NIPS 2025
☆ RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning
In web data, product images are central to boosting user engagement and advertising efficacy on e-commerce platforms, yet the intrusive elements such as watermarks and promotional text remain major obstacles to delivering clear and appealing product visuals. Although diffusion-based inpainting methods have advanced, they still face challenges in commercial settings due to unreliable object removal and limited domain-specific adaptation. To tackle these challenges, we propose Repainter, a reinforcement learning framework that integrates spatial-matting trajectory refinement with Group Relative Policy Optimization (GRPO). Our approach modulates attention mechanisms to emphasize background context, generating higher-reward samples and reducing unwanted object insertion. We also introduce a composite reward mechanism that balances global, local, and semantic constraints, effectively reducing visual artifacts and reward hacking. Additionally, we contribute EcomPaint-100K, a high-quality, large-scale e-commerce inpainting dataset, and a standardized benchmark EcomPaint-Bench for fair evaluation. Extensive experiments demonstrate that Repainter significantly outperforms state-of-the-art methods, especially in challenging scenes with intricate compositions. We will release our code and weights upon acceptance.
☆ Mutual Learning for Hashing: Unlocking Strong Hash Functions from Weak Supervision
Deep hashing has been widely adopted for large-scale image retrieval, with numerous strategies proposed to optimize hash function learning. Pairwise-based methods are effective in learning hash functions that preserve local similarity relationships, whereas center-based methods typically achieve superior performance by more effectively capturing global data distributions. However, the strength of center-based methods in modeling global structures often comes at the expense of underutilizing important local similarity information. To address this limitation, we propose Mutual Learning for Hashing (MLH), a novel weak-to-strong framework that enhances a center-based hashing branch by transferring knowledge from a weaker pairwise-based branch. MLH consists of two branches: a strong center-based branch and a weaker pairwise-based branch. Through an iterative mutual learning process, the center-based branch leverages local similarity cues learned by the pairwise-based branch. Furthermore, inspired by the mixture-of-experts paradigm, we introduce a novel mixture-of-hash-experts module that enables effective cross-branch interaction, further enhancing the performance of both branches. Extensive experiments demonstrate that MLH consistently outperforms state-of-the-art hashing methods across multiple benchmark datasets.
☆ Hybrid CNN-BYOL Approach for Fault Detection in Induction Motors Using Thermal Images
Induction motors (IMs) are indispensable in industrial and daily life, but they are susceptible to various faults that can lead to overheating, wasted energy consumption, and service failure. Early detection of faults is essential to protect the motor and prolong its lifespan. This paper presents a hybrid method that integrates BYOL with CNNs for classifying thermal images of induction motors for fault detection. The thermal dataset used in this work includes different operating states of the motor, such as normal operation, overload, and faults. We employed multiple deep learning (DL) models for the BYOL technique, ranging from popular architectures such as ResNet-50, DenseNet-121, DenseNet-169, EfficientNetB0, VGG16, and MobileNetV2. Additionally, we introduced a new high-performance yet lightweight CNN model named BYOL-IMNet, which comprises four custom-designed blocks tailored for fault classification in thermal images. Our experimental results demonstrate that the proposed BYOL-IMNet achieves 99.89\% test accuracy and an inference time of 5.7 ms per image, outperforming state-of-the-art models. This study highlights the promising performance of the CNN-BYOL hybrid method in enhancing accuracy for detecting faults in induction motors, offering a robust methodology for online monitoring in industrial settings.
☆ Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs
This study evaluates whether integrating curriculum learning with diffusion-based synthetic augmentation can enhance the detection of difficult pulmonary nodules in chest radiographs, particularly those with low size, brightness, and contrast, which often challenge conventional AI models due to data imbalance and limited annotation. A Faster R-CNN with a Feature Pyramid Network (FPN) backbone was trained on a hybrid dataset comprising expert-labeled NODE21 (1,213 patients; 52.4 percent male; mean age 63.2 +/- 11.5 years), VinDr-CXR, CheXpert, and 11,206 DDPM-generated synthetic images. Difficulty scores based on size, brightness, and contrast guided curriculum learning. Performance was compared to a non-curriculum baseline using mean average precision (mAP), Dice score, and area under the curve (AUC). Statistical tests included bootstrapped confidence intervals, DeLong tests, and paired t-tests. The curriculum model achieved a mean AUC of 0.95 versus 0.89 for the baseline (p < 0.001), with improvements in sensitivity (70 percent vs. 48 percent) and accuracy (82 percent vs. 70 percent). Stratified analysis demonstrated consistent gains across all difficulty bins (Easy to Very Hard). Grad-CAM visualizations confirmed more anatomically focused attention under curriculum learning. These results suggest that curriculum-guided synthetic augmentation enhances model robustness and generalization for pulmonary nodule detection.
comment: 32 pages, 6 figures,
☆ Controllable Video Synthesis via Variational Inference
Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.
comment: Project page: https://video-synthesis-variational.github.io/
☆ TCIP: Threshold-Controlled Iterative Pyramid Network for Deformable Medical Image Registration
Although pyramid networks have demonstrated superior performance in deformable medical image registration, their decoder architectures are inherently prone to propagating and accumulating anatomical structure misalignments. Moreover, most existing models do not adaptively determine the number of iterations for optimization under varying deformation requirements across images, resulting in either premature termination or excessive iterations that degrades registration accuracy. To effectively mitigate the accumulation of anatomical misalignments, we propose the Feature-Enhanced Residual Module (FERM) as the core component of each decoding layer in the pyramid network. FERM comprises three sequential blocks that extract anatomical semantic features, learn to suppress irrelevant features, and estimate the final deformation field, respectively. To adaptively determine the number of iterations for varying images, we propose the dual-stage Threshold-Controlled Iterative (TCI) strategy. In the first stage, TCI assesses registration stability and with asserted stability, it continues with the second stage to evaluate convergence. We coin the model that integrates FERM and TCI as Threshold-Controlled Iterative Pyramid (TCIP). Extensive experiments on three public brain MRI datasets and one abdomen CT dataset demonstrate that TCIP outperforms the state-of-the-art (SOTA) registration networks in terms of accuracy, while maintaining comparable inference speed and a compact model parameter size. Finally, we assess the generalizability of FERM and TCI by integrating them with existing registration networks and further conduct ablation studies to validate the effectiveness of these two proposed methods.
☆ Automatic Text Box Placement for Supporting Typographic Design
In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.
☆ MONKEY: Masking ON KEY-Value Activation Adapter for Personalization
Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.
☆ Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection
Video virtual try-on aims to replace the clothing of a person in a video with a target garment. Current dual-branch architectures have achieved significant success in diffusion models based on the U-Net; however, adapting them to diffusion models built upon the Diffusion Transformer remains challenging. Initially, introducing latent space features from the garment reference branch requires adding or modifying the backbone network, leading to a large number of trainable parameters. Subsequently, the latent space features of garments lack inherent temporal characteristics and thus require additional learning. To address these challenges, we propose a novel approach, OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement: specifically, we employ an image-based clothing transfer model to replace the clothing in the initial frame, and then, under the content control of the edited first frame, utilize pose and mask information to guide the temporal prior of the video generation model in synthesizing the remaining frames sequentially. Experiments show that our method achieves superior parameter efficiency and computational efficiency while still maintaining leading performance under these constraints.
comment: 5 pages (including references), 4 figures. Code and models will be released upon publication
☆ Dual-Stream Alignment for Action Segmentation
Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio-temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross-attention and Quantum-based Action-Guided Modulation (Q-ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing
comment: Journal Submission
☆ PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment ICIP 2025
Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at https://www.github.com/shngt/pit-qmm.
comment: Oral presentation at ICIP 2025
☆ Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.
☆ Rectified-CFG++ for Flow Based Models NeurIPS 2025
Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: https://rectified-cfgpp.github.io/
comment: Accepted at NeurIPS 2025
☆ Team Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception -- Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track
In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent's ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.
☆ Sparse components distinguish visual pathways & their alignment to neural networks
The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams -- identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system's underlying axes of neural tuning.
☆ FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation
Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.
☆ The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective NeurIPS 2025
As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.
comment: Accepted at NeurIPS 2025
☆ Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction
Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.
☆ D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.
comment: This paper has been accepted to EMNLP 2025
☆ SkipSR: Faster Super Resolution with Token Skipping
Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/
comment: 14 pages, 7 figures
☆ Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering CVPR2025
Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.
comment: CVPR2025 Paper
☆ Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization
Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.
☆ Re-Identifying Kākā with AI-Automated Video Key Frame Extraction
Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of k\={a}k\={a} (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in k\={a}k\={a} re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising k\={a}k\={a} individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.
☆ LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our "precision-over-volume" principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.
comment: 19 pages, 9 figures, 6 tables
☆ Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform
This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.
comment: 6 pages
☆ SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense
Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.
☆ BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/
☆ Structured Output Regularization: a framework for few-shot transfer learning
Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model's ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and $L_1$ penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.
☆ Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.
comment: 18 pages, 11 figures, code: https://github.com/F1y1113/UniWM
☆ Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.
comment: Project Page: https://kangliao929.github.io/projects/puffin/
☆ FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching
The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in future timesteps. However, previous feature caching assumes that features in adjacent timesteps are similar or continuous, which does not always hold in all settings. To investigate this, this paper begins with an analysis from the frequency domain, which reveal that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps. Concretely, low-frequency components, which decide the structure of images, exhibit higher similarity but poor continuity. In contrast, the high-frequency bands, which decode the details of images, show significant continuity but poor similarity. These interesting observations motivate us to propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity, while using a second-order Hermite interpolator to predict the volatile high-frequency ones based on its continuity. Besides, we further propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%. Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate its effectiveness in both generation and editing. Codes are available in the supplementary materials and will be released on GitHub.
comment: 15 pages, 11 figures
☆ Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.
☆ A 3D Generation Framework from Cross Modality to Parameterized Primitive
Recent advancements in AI-driven 3D model generation have leveraged cross modality, yet generating models with smooth surfaces and minimizing storage overhead remain challenges. This paper introduces a novel multi-stage framework for generating 3D models composed of parameterized primitives, guided by textual and image inputs. In the framework, A model generation algorithm based on parameterized primitives, is proposed, which can identifies the shape features of the model constituent elements, and replace the elements with parameterized primitives with high quality surface. In addition, a corresponding model storage method is proposed, it can ensure the original surface quality of the model, while retaining only the parameters of parameterized primitives. Experiments on virtual scene dataset and real scene dataset demonstrate the effectiveness of our method, achieving a Chamfer Distance of 0.003092, a VIoU of 0.545, a F1-Score of 0.9139 and a NC of 0.8369, with primitive parameter files approximately 6KB in size. Our approach is particularly suitable for rapid prototyping of simple models.
♻ ☆ TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
comment: 9 pages, 6 figures
♻ ☆ Human Action Recognition from Point Clouds over Time
Recent research into human action recognition (HAR) has focused predominantly on skeletal action recognition and video-based methods. With the increasing availability of consumer-grade depth sensors and Lidar instruments, there is a growing opportunity to leverage dense 3D data for action recognition, to develop a third way. This paper presents a novel approach for recognizing actions from 3D videos by introducing a pipeline that segments human point clouds from the background of a scene, tracks individuals over time, and performs body part segmentation. The method supports point clouds from both depth sensors and monocular depth estimation. At the core of the proposed HAR framework is a novel backbone for 3D action recognition, which combines point-based techniques with sparse convolutional networks applied to voxel-mapped point cloud sequences. Experiments incorporate auxiliary point features including surface normals, color, infrared intensity, and body part parsing labels, to enhance recognition accuracy. Evaluation on the NTU RGB- D 120 dataset demonstrates that the method is competitive with existing skeletal action recognition algorithms. Moreover, combining both sensor-based and estimated depth inputs in an ensemble setup, this approach achieves 89.3% accuracy when different human subjects are considered for training and testing, outperforming previous point cloud action recognition methods.
♻ ☆ SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.
♻ ☆ Explaining raw data complexity to improve satellite onboard processing
With increasing processing power, deploying AI models for remote sensing directly onboard satellites is becoming feasible. However, new constraints arise, mainly when using raw, unprocessed sensor data instead of preprocessed ground-based products. While current solutions primarily rely on preprocessed sensor images, few approaches directly leverage raw data. This study investigates the effects of utilising raw data on deep learning models for object detection and classification tasks. We introduce a simulation workflow to generate raw-like products from high-resolution L1 imagery, enabling systemic evaluation. Two object detection models (YOLOv11n and YOLOX-S) are trained on both raw and L1 datasets, and their performance is compared using standard detection metrics and explainability tools. Results indicate that while both models perform similarly at low to medium confidence thresholds, the model trained on raw data struggles with object boundary identification at high confidence levels. It suggests that adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.
comment: Preprint: European Data Handling & Data Processing Conference (EDHPC) 2025
♻ ☆ Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
comment: The 1st version
♻ ☆ Play to Generalize: Learning to Reason Through Game Play
Developing reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by literature suggesting that gameplay promotes transferable reasoning skills, we propose a novel post-training method, Visual Game Learning (ViGaL), where MLLMs develop generalizable reasoning skills through playing arcade-like games. Specifically, we show that training a 7B-parameter MLLM via reinforcement learning (RL) on simple games like Snake significantly enhances the downstream performance on multimodal math benchmarks like MathVista, on multi-discipline questions like MMMU and on 3D spatial reasoning benchmarks like VSI-Bench, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, our model outperforms specialist models post-trained on benchmark-oriented multimodal reasoning data, while preserving the model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest that multimodal reasoning can emerge from gameplay, pointing to a promising strategy of designing surrogate tasks for RL post-training.
comment: Project Page: https://yunfeixie233.github.io/ViGaL/
♻ ☆ IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout
Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.
♻ ☆ Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.
comment: dataset link: https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
♻ ☆ ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Queries are grounded in satellite or aerial imagery, including both optical RGB and SAR data, and require agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 486 structured agentic tasks with 1,773 expert-verified reasoning steps. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing.
♻ ☆ Paper2Video: Automatic Video Generation from Scientific Papers
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.
comment: Project Page: https://showlab.github.io/Paper2Video/
♻ ☆ Deblurring in the Wild: A Real-World Image Deblurring Dataset from Smartphone High-Speed Videos
We introduce the largest real-world image deblurring dataset constructed from smartphone slow-motion videos. Using 240 frames captured over one second, we simulate realistic long-exposure blur by averaging frames to produce blurry images, while using the temporally centered frame as the sharp reference. Our dataset contains over 42,000 high-resolution blur-sharp image pairs, making it approximately 10 times larger than widely used datasets, with 8 times the amount of different scenes, including indoor and outdoor environments, with varying object and camera motions. We benchmark multiple state-of-the-art (SOTA) deblurring models on our dataset and observe significant performance degradation, highlighting the complexity and diversity of our benchmark. Our dataset serves as a challenging new benchmark to facilitate robust and generalizable deblurring models.
comment: 8 pages (without references), 3 figures. Dataset https://huggingface.co/datasets/masterda/SloMoBlur
♻ ☆ Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective NeurIPS2024
State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.
comment: NeurIPS2024. Code:https://github.com/QishuaiWen/DEPICT/
♻ ☆ Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.
♻ ☆ ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification ICCV 2025
Bone health studies are crucial in medical practice for the early detection and treatment of Osteopenia and Osteoporosis. Clinicians usually make a diagnosis based on densitometry (DEXA scans) and patient history. The applications of AI in this field are ongoing research. Most successful methods rely on deep learning models that use vision alone (DEXA/X-ray imagery) and focus on prediction accuracy, while explainability is often disregarded and left to post hoc assessments of input contributions. We propose ProtoMedX, a multi-modal (multimodal) model that uses both DEXA scans of the lumbar spine and patient records. ProtoMedX's prototype-based architecture is explainable by design, which is crucial for medical applications, especially in the context of the upcoming EU AI Act, as it allows explicit analysis of model decisions, including incorrect ones. ProtoMedX demonstrates state-of-the-art performance in bone health classification while also providing explanations that can be visually understood by clinicians. Using a dataset of 4,160 real NHS patients, the proposed ProtoMedX achieves 87.58% accuracy in vision-only tasks and 89.8% in its multi-modal variant, both surpassing existing published methods.
comment: ICCV 2025 (PHAROS-AFE-AIMI: Adaptation, Fairness, and Explainability in Medical Imaging). 8 pages, 5 figures, 4 tables. Keywords: multi-modal, multimodal, prototype learning, explainable AI, interpretable models, case-based reasoning, medical imaging, DEXA, bone health, osteoporosis, osteopenia, diagnosis, classification, clustering
♻ ☆ MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and normalization schemes of reconstruction targets for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets in both intra- and cross-dataset settings, MAESTRO achieves state-of-the-art performance on tasks that strongly rely on multitemporal dynamics, while also remaining competitive on others. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.
♻ ☆ Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO ECCV 2024
Smart glasses are rapidly gaining advanced functions thanks to cutting-edge computing technologies, especially accelerated hardware architectures, and tiny Artificial Intelligence (AI) algorithms. However, integrating AI into smart glasses featuring a small form factor and limited battery capacity remains challenging for a satisfactory user experience. To this end, this paper proposes the design of a smart glasses platform for always-on on-device object detection with an all-day battery lifetime. The proposed platform is based on GAP9, a novel multi-core RISC-V processor from Greenwaves Technologies. Additionally, a family of sub-million parameter TinyissimoYOLO networks are proposed. They are benchmarked on established datasets, capable of differentiating up to 80 classes on MS-COCO. Evaluations on the smart glasses prototype demonstrate TinyissimoYOLO's inference latency of only 17ms and consuming 1.59mJ energy per inference. An end-to-end latency of 56ms is achieved which is equivalent to 18 frames per seconds (FPS) with a total power consumption of 62.9mW. This ensures continuous system runtime of up to 9.3 hours on a 154mAh battery. These results outperform MCUNet (TinyNAS+TinyEngine), which runs a simpler task (image classification) at just 7.3 FPS, while the 18 FPS achieved in this paper even include image-capturing, network inference, and detection post-processing. The algorithm's code is released open with this paper and can be found here: https://github.com/ETH-PBL/TinyissimoYOLO
comment: This paper has been accepted for publication at ECCV 2024 Workshops, Milan, 2024
♻ ☆ H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging
3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection \textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.
♻ ☆ MonoGSDF: Exploring Monocular Geometric Cues for Gaussian Splatting-Guided Implicit Surface Reconstruction
Accurate meshing from monocular images remains a key challenge in 3D vision. While state-of-the-art 3D Gaussian Splatting (3DGS) methods excel at synthesizing photorealistic novel views through rasterization-based rendering, their reliance on sparse, explicit primitives severely limits their ability to recover watertight and topologically consistent 3D surfaces.We introduce MonoGSDF, a novel method that couples Gaussian-based primitives with a neural Signed Distance Field (SDF) for high-quality reconstruction. During training, the SDF guides Gaussians' spatial distribution, while at inference, Gaussians serve as priors to reconstruct surfaces, eliminating the need for memory-intensive Marching Cubes. To handle arbitrary-scale scenes, we propose a scaling strategy for robust generalization. A multi-resolution training scheme further refines details and monocular geometric cues from off-the-shelf estimators enhance reconstruction quality. Experiments on real-world datasets show MonoGSDF outperforms prior methods while maintaining efficiency.
♻ ☆ Towards Methane Detection Onboard Satellites
Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.
♻ ☆ Feedback Guidance of Diffusion Models NeurIPS25
While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.
comment: Article accepeted as poster at the 39th Annual Conference on Neural Information Processing Systems (NeurIPS25). Code is available at: https://github.com/FelixKoulischer/FBG_using_edm2
♻ ☆ Prompt-guided Representation Disentanglement for Action Recognition
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git
♻ ☆ VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones
Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.
comment: 19 pages
♻ ☆ DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images NeurIPS 2025
Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: https://diff-eye.github.io/
comment: Accepted to NeurIPS 2025
♻ ☆ GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering
Foveated rendering significantly reduces computational demands in virtual reality applications by concentrating rendering quality where users focus their gaze. Current approaches require expensive hardware-based eye tracking systems, limiting widespread adoption due to cost, calibration complexity, and hardware compatibility constraints. This paper presents GazeProphet, a software-only approach for predicting gaze locations in VR environments without requiring dedicated eye tracking hardware. The approach combines a Spherical Vision Transformer for processing 360-degree VR scenes with an LSTM-based temporal encoder that captures gaze sequence patterns. A multi-modal fusion network integrates spatial scene features with temporal gaze dynamics to predict future gaze locations with associated confidence estimates. Experimental evaluation on a comprehensive VR dataset demonstrates that GazeProphet achieves a median angular error of 3.83 degrees, outperforming traditional saliency-based baselines by 24% while providing reliable confidence calibration. The approach maintains consistent performance across different spatial regions and scene types, enabling practical deployment in VR systems without additional hardware requirements. Statistical analysis confirms the significance of improvements across all evaluation metrics. These results show that software-only gaze prediction can work for VR foveated rendering, making this performance boost more accessible to different VR platforms and apps.
comment: 8 pages, 3 figures
♻ ☆ Language learning shapes visual category-selectivity in deep neural networks
Category-selective regions in the human brain-such as the fusiform face area (FFA), extrastriate body area (EBA), parahippocampal place area (PPA), and visual word form area (VWFA)-support high-level visual recognition. Here, we investigate whether artificial neural networks (ANNs) exhibit analogous category-selective neurons and how these representations are shaped by language experience. Using an fMRI-inspired functional localizer approach, we identified face-, body-, place-, and word-selective neurons in deep networks presented with category images and scrambled controls. Both the purely visual ResNet and a linguistically supervised Lang-Learned ResNet contained category-selective neurons that increased in proportion across layers. However, compared to the vision-only model, the Lang-Learned ResNet showed a greater number but lower specificity of category-selective neurons, along with reduced spatial localization and attenuated activation strength-indicating a shift toward more distributed, semantically aligned coding. These effects were replicated in the large-scale vision-language model CLIP. Together, our findings reveal that language experience systematically reorganizes visual category representations in ANNs, providing a computational parallel to how linguistic context may shape categorical organization in the human brain.
♻ ☆ Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
♻ ☆ LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.
comment: LLaVA-OneVision-1.5 Technical Report
♻ ☆ Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks NeurIPS 2025
Spiking neural networks (SNNs) are emerging as a promising alternative to traditional artificial neural networks (ANNs), offering biological plausibility and energy efficiency. Despite these merits, SNNs are frequently hampered by limited capacity and insufficient representation power, yet remain underexplored in remote sensing super-resolution (SR) tasks. In this paper, we first observe that spiking signals exhibit drastic intensity variations across diverse textures, highlighting an active learning state of the neurons. This observation motivates us to apply SNNs for efficient SR of RSIs. Inspired by the success of attention mechanisms in representing salient information, we devise the spiking attention block (SAB), a concise yet effective component that optimizes membrane potentials through inferred attention weights, which, in turn, regulates spiking activity for superior feature representation. Our key contributions include: 1) we bridge the independent modulation between temporal and channel dimensions, facilitating joint feature correlation learning, and 2) we access the global self-similar patterns in large-scale remote sensing imagery to infer spatial attention weights, incorporating effective priors for realistic and faithful reconstruction. Building upon SAB, we proposed SpikeSR, which achieves state-of-the-art performance across various remote sensing benchmarks such as AID, DOTA, and DIOR, while maintaining high computational efficiency. Code of SpikeSR will be available at https://github.com/XY-boy/SpikeSR.
comment: Accepted by NeurIPS 2025
♻ ☆ DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models. Code and Model: https://github.com/aim-uofa/Diception
comment: Homepage: https://aim-uofa.github.io/Diception, Code and Model: https://github.com/aim-uofa/Diception
♻ ☆ EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval
Text-to-image retrieval is a critical task for managing diverse visual content, but common benchmarks for the task rely on small, single-domain datasets that fail to capture real-world complexity. Pre-trained vision-language models tend to perform well with easy negatives but struggle with hard negatives--visually similar yet incorrect images--especially in open-domain scenarios. To address this, we introduce Episodic Few-Shot Adaptation (EFSA), a novel test-time framework that adapts pre-trained models dynamically to a query's domain by fine-tuning on top-k retrieved candidates and synthetic captions generated for them. EFSA improves performance across diverse domains while preserving generalization, as shown in evaluations on queries from eight highly distinct visual domains and an open-domain retrieval pool of over one million images. Our work highlights the potential of episodic few-shot adaptation to enhance robustness in the critical and understudied task of open-domain text-to-image retrieval.
♻ ☆ RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation ICCV 2025
Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
comment: Accept by ICCV 2025 (Highlight). Project website: https://colorful-liyu.github.io/RAGDiffusion-page/
♻ ☆ FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification NeurIPS 2025
Medical image classification requires not only high predictive performance but also interpretability to ensure clinical trust and adoption. Graph Neural Networks (GNNs) offer a powerful framework for modeling relational structures within datasets; however, standard GNNs often operate as black boxes, limiting transparency and usability, particularly in clinical settings. In this work, we present an interpretable graph-based learning framework named FireGNN that integrates trainable fuzzy rules into GNNs for medical image classification. These rules embed topological descriptors - node degree, clustering coefficient, and label agreement - using learnable thresholds and sharpness parameters to enable intrinsic symbolic reasoning. Additionally, we explore auxiliary self-supervised tasks (e.g., homophily prediction, similarity entropy) as a benchmark to evaluate the contribution of topological learning. Our fuzzy-rule-enhanced model achieves strong performance across five MedMNIST benchmarks and the synthetic dataset MorphoMNIST, while also generating interpretable rule-based explanations. To our knowledge, this is the first integration of trainable fuzzy rules within a GNN. Source Code: https://github.com/basiralab/FireGNN
comment: Accepted at NeurIPS 2025 Conference (Workshop Track), San Diego, USA
♻ ☆ Maintaining Performance with Less Data
We propose a novel method for training a neural network for image classification to reduce input data dynamically, in order to reduce the costs of training a neural network model. As Deep Learning tasks become more popular, their computational complexity increases, leading to more intricate algorithms and models which have longer runtimes and require more input data. The result is a greater cost on time, hardware, and environmental resources. By using data reduction techniques, we reduce the amount of work performed, and therefore the environmental impact of AI techniques, and with dynamic data reduction we show that accuracy may be maintained while reducing runtime by up to 50%, and reducing carbon emission proportionally.
comment: 12 pages, 8 figures, 11 tables
♻ ☆ Motion Capture from Inertial and Vision Sensors
Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS dataset, we propose a SparseNet framework to capture human motion from IMUs and videos by discovering their supplementary features and exploring the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.
comment: 10 pages,8 figures
♻ ☆ MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {\Theta}(40K) video clips and {\Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.
♻ ☆ CurvNet: Latent Contour Representation and Iterative Data Engine for Curvature Angle Estimation
Curvature angle is a quantitative measurement of a curve, in which Cobb angle is customized for spinal curvature. Automatic Cobb angle measurement from X-ray images is crucial for scoliosis screening and diagnosis. However, most existing regression-based and segmentation-based methods struggle with inaccurate spine representations or mask connectivity and fragmentation issues. Besides, landmark-based methods suffer from insufficient training data and annotations. To address these challenges, we propose a novel curvature angle estimation framework named CurvNet including latent contour representation based contour detection and iterative data engine based image self-generation. Specifically, we propose a parameterized spine contour representation in latent space, which enables eigen-spine decomposition and spine contour reconstruction. Latent contour coefficient regression is combined with anchor box classification to solve inaccurate predictions and mask connectivity issues. Moreover, we develop a data engine with image self-generation, automatic annotation, and automatic selection in an iterative manner. By our data engine, we generate a clean dataset named Spinal-AI2024 without privacy leaks, which is the largest released scoliosis X-ray dataset to our knowledge. Extensive experiments on public AASCE2019, our private Spinal2023, and our generated Spinal-AI2024 datasets demonstrate that our method achieves state-of-the-art Cobb angle estimation performance. Our code and Spinal-AI2024 dataset are available at https://github.com/Ernestchenchen/CurvNet and https://github.com/Ernestchenchen/Spinal-AI2024, respectively.
♻ ☆ PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection
In this paper, we aim to transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.
comment: Submitted to TPAMI
♻ ☆ Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS
comment: work in progress
♻ ☆ Revisiting Reweighted Risk for Calibration: AURC, Focal, and Inverse Focal Loss
Several variants of reweighted risk functionals, such as focal loss, inverse focal loss, and the Area Under the Risk--Coverage Curve (AURC), have been proposed for improving model calibration, yet their theoretical connections to calibration errors remain unclear. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between calibration error and selective classification. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing selective risk in low-confidence region naturally leads to improved calibration. This loss shares a similar reweighting strategy with dual focal loss but offers greater flexibility through the choice of confidence score functions (CSFs). Our approach uses a bin-based cumulative distribution function (CDF) approximation, enabling efficient gradient-based optimization without requiring expensive sorting and achieving $O(nK)$ complexity. Empirical evaluations demonstrate that our method achieves competitive calibration performance across a range of datasets and model architectures.
♻ ☆ Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models
Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.
♻ ☆ Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation
The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.
♻ ☆ OASIS: Online Sample Selection for Continual Visual Instruction Tuning
In continual instruction tuning (CIT) scenarios, where new instruction tuning data continuously arrive in an online streaming manner, training delays from large-scale data significantly hinder real-time adaptation. Data selection can mitigate this overhead, but existing strategies often rely on pretrained reference models, which are impractical in CIT setups since future data are unknown. Recent reference model-free online sample selection methods address this, but typically select a fixed number of samples per batch (e.g., top-k), making them vulnerable to distribution shifts where informativeness varies across batches. To address these limitations, we propose OASIS, an adaptive online sample selection approach for CIT that (1) selects informative samples by estimating each sample's informativeness relative to all previously seen data, beyond batch-level constraints, and (2) minimizes informative redundancy of selected samples through iterative selection score updates. Experiments on various large foundation models show that OASIS, using only 25 percent of the data, achieves comparable performance to full-data training and outperforms the state-of-the-art sampling methods.
♻ ☆ MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement
We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF
comment: Code: https://github.com/MAGREF-Video/MAGREF/; Project website: https://magref-video.github.io/
♻ ☆ PainFormer: a Vision Foundation Model for Automatic Pain Assessment
Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities - including RGB, synthetic thermal, and estimated depth videos - and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 75 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment. The foundation model's architecture (code) and weights are available at: https://github.com/GkikasStefanos/PainFormer.
♻ ☆ MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography
We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the "InDI" concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D's effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well.
♻ ☆ I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization
Albeit the scalable performance of vision transformers (ViTs), the dense computational costs (training & inference) undermine their position in industrial applications. Post-training quantization (PTQ), tuning ViTs with a tiny dataset and running in a low-bit format, well addresses the cost issue but unluckily bears more performance drops in lower-bit cases. In this paper, we introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT first identifies two issues in the PTQ of ViTs: (1) Quantization inefficiency in the prevalent log2 quantizer for post-Softmax activations; (2) Rugged and magnified loss landscape in coarse-grained quantization granularity for post-LayerNorm activations. Then, I&S-ViT addresses these issues by introducing: (1) A novel shift-uniform-log2 quantizer (SULQ) that incorporates a shift mechanism followed by uniform quantization to achieve both an inclusive domain representation and accurate distribution approximation; (2) A three-stage smooth optimization strategy (SOS) that amalgamates the strengths of channel-wise and layer-wise quantization to enable stable learning. Comprehensive evaluations across diverse vision tasks validate I&S-ViT' superiority over existing PTQ of ViTs methods, particularly in low-bit scenarios. For instance, I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
comment: This paper has been accepted by TPAMI2025
♻ ☆ CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D ICLR 2026
3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.
comment: 9 pages, 4 figures, submitted for ICLR 2026 conference
♻ ☆ MAMBO: High-Resolution Generative Approach for Mammography Images
Mammography is the gold standard for the detection and diagnosis of breast cancer. This procedure can be significantly enhanced with Artificial Intelligence (AI)-based software, which assists radiologists in identifying abnormalities. However, training AI systems requires large and diverse datasets, which are often difficult to obtain due to privacy and ethical constraints. To address this issue, the paper introduces MAMmography ensemBle mOdel (MAMBO), a novel patch-based diffusion approach designed to generate full-resolution mammograms. Diffusion models have shown breakthrough results in realistic image generation, yet few studies have focused on mammograms, and none have successfully generated high-resolution outputs required to capture fine-grained features of small lesions. To achieve this, MAMBO integrates separate diffusion models to capture both local and global (image-level) contexts. The contextual information is then fed into the final model, significantly aiding the noise removal process. This design enables MAMBO to generate highly realistic mammograms of up to 3840x3840 pixels. Importantly, this approach can be used to enhance the training of classification models and extended to anomaly segmentation. Experiments, both numerical and radiologist validation, assess MAMBO's capabilities in image generation, super-resolution, and anomaly segmentation, highlighting its potential to enhance mammography analysis for more accurate diagnoses and earlier lesion detection. The source code used in this study is publicly available at: https://github.com/iai-rs/mambo.
comment: 21 pages, 14 figures, 7 tables
♻ ☆ TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
Transformer-based architectures have become the backbone of both uni-modal and multi-modal foundation models, largely due to their scalability via attention mechanisms, resulting in a rich ecosystem of publicly available pre-trained models such as LLaVA, CLIP, and DeiT, etc. In parallel, emerging sub-quadratic architectures like Mamba offer promising efficiency gains by enabling global context modeling with linear complexity. However, training these architectures from scratch remains resource-intensive (e.g., in terms of data and time). Motivated by this challenge, we explore a cross-architecture knowledge transfer paradigm, termed TransMamba, that facilitates the reuse of Transformer pre-trained knowledge. We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks. The first stage leverages pre-trained Transformer models to initialize critical components of the Mamba architecture. To bridge architectural and dimensional gaps, we develop a selective weight subcloning strategy and a layered initialization scheme that prioritizes the early $n$ layers. Building on this initialization, the second stage introduces an adaptive multi-directional knowledge distillation method. This mechanism employs layer-wise adaptive scaling factors to align Mamba representations with their Transformer counterparts, while accommodating the scanning order variations inherent to multi-modal Mamba architectures. Despite operating with a reduced training dataset and a more compact model architecture, TransMamba consistently outperforms baseline approaches across diverse mamba-based backbones (e.g., PlainMamba, Vmamba, ViM and VideoMamba) and downstream tasks (e.g., image classification, visual question answering, text-video retrieval and multimodal reasoning). All code and implementation details will be released.
♻ ☆ How We Won BraTS-SSA 2025: Brain Tumor Segmentation in the Sub-Saharan African Population Using Segmentation-Aware Data Augmentation and Model Ensembling MICCAI
Brain tumors, particularly gliomas, pose significant chall-enges due to their complex growth patterns, infiltrative nature, and the variability in brain structure across individuals, which makes accurate diagnosis and monitoring difficult. Deep learning models have been developed to accurately delineate these tumors. However, most of these models were trained on relatively homogenous high-resource datasets, limiting their robustness when deployed in underserved regions. In this study, we performed segmentation-aware offline data augmentation on the BraTS-Africa dataset to increase the data sample size and diversity to enhance generalization. We further constructed an ensemble of three distinct architectures, MedNeXt, SegMamba, and Residual-Encoder U-Net, to leverage their complementary strengths. Our best-performing model, MedNeXt, was trained on 1000 epochs and achieved the highest average lesion-wise dice and normalized surface distance scores of 0.86 and 0.81 respectively. However, the ensemble model trained for 500 epochs produced the most balanced segmentation performance across the tumour subregions. This work demonstrates that a combination of advanced augmentation and model ensembling can improve segmentation accuracy and robustness on diverse and underrepresented datasets. Code available at: https://github.com/SPARK-Academy-2025/SPARK-2025/tree/main/SPARK2025_BraTs_MODELS/SPARK_NeuroAshanti
comment: Brain Tumor Segmentation Challenge, International Medical Image Computing and Computer Assisted Intervention (MICCAI) Conference, 11 Pages, 2 Figures, 2 Tables
♻ ☆ GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement
Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.
comment: 12 pages, 7 figures
♻ ☆ Attention based End to end network for Offline Writer Identification on Word level data
Writer identification due to its widespread application in various fields has gained popularity over the years. In scenarios where optimum handwriting samples are available, whether they be in the form of a single line, a sentence, or an entire page, writer identification algorithms have demonstrated noteworthy levels of accuracy. However, in scenarios where only a limited number of handwritten samples are available, particularly in the form of word images, there is a significant scope for improvement. In this paper, we propose a writer identification system based on an attention-driven Convolutional Neural Network (CNN). The system is trained utilizing image segments, known as fragments, extracted from word images, employing a pyramid-based strategy. This methodology enables the system to capture a comprehensive representation of the data, encompassing both fine-grained details and coarse features across various levels of abstraction. These extracted fragments serve as the training data for the convolutional network, enabling it to learn a more robust representation compared to traditional convolution-based networks trained on word images. Additionally, the paper explores the integration of an attention mechanism to enhance the representational power of the learned features. The efficacy of the proposed algorithm is evaluated on three benchmark databases, demonstrating its proficiency in writer identification tasks, particularly in scenarios with limited access to handwriting data.
♻ ☆ UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval -- under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.
♻ ☆ ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?
Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We create a dataset of 9.9k simulated and real images to bridge the visual sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improve part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.
comment: 8 pages, 6 figures
♻ ☆ Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
comment: 26 pages, 11 figures
♻ ☆ AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
♻ ☆ Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Video
Efficient and accurate reconstruction of a relightable, dynamic clothed human avatar from a monocular video is crucial for the entertainment industry. This paper presents SGIA (Surfel-based Gaussian Inverse Avatar), which introduces efficient training and rendering for relightable dynamic human reconstruction. SGIA advances previous Gaussian Avatar methods by comprehensively modeling Physically-Based Rendering (PBR) properties for clothed human avatars, allowing for the manipulation of avatars into novel poses under diverse lighting conditions. Specifically, our approach integrates pre-integration and image-based lighting for fast light calculations that surpass the performance of existing implicit-based techniques. To address challenges related to material lighting disentanglement and accurate geometry reconstruction, we propose an innovative occlusion approximation strategy and a progressive training approach. Extensive experiments demonstrate that SGIA not only achieves highly accurate physical properties but also significantly enhances the realistic relighting of dynamic human avatars, providing a substantial speed advantage. We exhibit more results in our project page: https://GS-IA.github.io.
comment: Accepted by TPAMI 2025; Project Page: https://GS-IA.github.io
♻ ☆ Targetless LiDAR-Camera Calibration with Neural Gaussian Splatting
Accurate LiDAR-camera calibration is crucial for multi-sensor systems. However, traditional methods often rely on physical targets, which are impractical for real-world deployment. Moreover, even carefully calibrated extrinsics can degrade over time due to sensor drift or external disturbances, necessitating periodic recalibration. To address these challenges, we present a Targetless LiDAR-Camera Calibration (TLC-Calib) that jointly optimizes sensor poses with a neural Gaussian-based scene representation. Reliable LiDAR points are frozen as anchor Gaussians to preserve global structure, while auxiliary Gaussians prevent local overfitting under noisy initialization. Our fully differentiable pipeline with photometric and geometric regularization achieves robust and generalizable calibration, consistently outperforming existing targetless methods on KITTI-360, Waymo, and FAST-LIVO2, and surpassing even the provided calibrations in rendering quality.
comment: Project page: https://zang09.github.io/tlc-calib-site
♻ ☆ Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.
♻ ☆ MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling
Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi-agent collaborative framework designed to assist in long-sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle--Explore, Examine, and Enhanc--to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.
comment: Video Generation Agent
♻ ☆ BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response
Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the development of AI models to achieve accurate mapping of unseen disaster events, mostly using optical EO data. However, solutions based on optical data are limited to clear skies and daylight hours, preventing a prompt response to disasters. Integrating multimodal (MM) EO data, particularly the combination of optical and SAR imagery, makes it possible to provide all-weather, day-and-night disaster responses. Despite this potential, the development of robust multimodal AI models has been constrained by the lack of suitable benchmark datasets. In this paper, we present a BDA dataset using veRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based all-weather disaster response. To the best of our knowledge, BRIGHT is the first open-access, globally distributed, event-diverse MM dataset specifically curated to support AI-based disaster response. It covers five types of natural disasters and two types of man-made disasters across 14 regions worldwide, with a particular focus on developing countries where external assistance is most needed. The optical and SAR imagery in BRIGHT, with a spatial resolution between 0.3-1 meters, provides detailed representations of individual buildings, making it ideal for precise BDA. In our experiments, we have tested seven advanced AI models trained with our BRIGHT to validate the transferability and robustness. The dataset and code are available at https://github.com/ChenHongruixuan/BRIGHT. BRIGHT also serves as the official dataset for the 2025 IEEE GRSS Data Fusion Contest.
♻ ☆ DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000$times$3000 resolution). In this paper, we propose DvD, the first generative model to tackle document Dewarping via a Diffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks, including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available at https://github.com/hanquansanren/DvD.
♻ ☆ Self-Training with Dynamic Weighting for Robust Gradual Domain Adaptation NIPS 25
In this paper, we propose a new method called Self-Training with Dynamic Weighting (STDW), which aims to enhance robustness in Gradual Domain Adaptation (GDA) by addressing the challenge of smooth knowledge migration from the source to the target domain. Traditional GDA methods mitigate domain shift through intermediate domains and self-training but often suffer from inefficient knowledge migration or incomplete intermediate data. Our approach introduces a dynamic weighting mechanism that adaptively balances the loss contributions of the source and target domains during training. Specifically, we design an optimization framework governed by a time-varying hyperparameter $\varrho$ (progressing from 0 to 1), which controls the strength of domain-specific learning and ensures stable adaptation. The method leverages self-training to generate pseudo-labels and optimizes a weighted objective function for iterative model updates, maintaining robustness across intermediate domains. Experiments on rotated MNIST, color-shifted MNIST, portrait datasets, and the Cover Type dataset demonstrate that STDW outperforms existing baselines. Ablation studies further validate the critical role of $\varrho$'s dynamic scheduling in achieving progressive adaptation, confirming its effectiveness in reducing domain bias and improving generalization. This work provides both theoretical insights and a practical framework for robust gradual domain adaptation, with potential applications in dynamic real-world scenarios. The code is available at https://github.com/Dramwig/STDW.
comment: Accepted by NIPS 25
♻ ☆ Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.
♻ ☆ HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score
Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.
♻ ☆ PRVR: Partially Relevant Video Retrieval
In current text-to-video retrieval (T2VR), videos to be retrieved have been properly trimmed so that a correspondence between the videos and ad-hoc textual queries naturally exists. Note in practice that videos circulated on the Internet and social media platforms, while being relatively short, are typically rich in their content. Often, multiple scenes / actions / events are shown in a single video, leading to a more challenging T2VR setting wherein only part of the video content is relevant w.r.t. a given query. This paper presents a first study on this setting which we term Partially Relevant Video Retrieval (PRVR). Considering that a video typically consists of multiple moments, a video is regarded as partially relevant w.r.t. to a given query if it contains a query-related moment. We formulate the PRVR task as a multiple instance learning problem, and propose a Multi-Scale Similarity Learning (MS-SL++) network that jointly learns both clip-scale and frame-scale similarities to determine the partial relevance between video-query pairs. Extensive experiments on three diverse video-text datasets (TVshow Retrieval, ActivityNet-Captions and Charades-STA) demonstrate the viability of the proposed method.
comment: Accepted by TPAMI. The paper's homepage is https://github.com/HuiGuanLab/ms-sl-pp
♻ ☆ Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations
Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.
♻ ☆ Product of Experts for Visual Generation
Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.
comment: Project page: https://product-of-experts.github.io/
♻ ☆ BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye View
3D Single Object Tracking (SOT) is a fundamental task in computer vision and plays a critical role in applications like autonomous driving. However, existing algorithms often involve complex designs and multiple loss functions, making model training and deployment challenging. Furthermore, their reliance on fixed probability distribution assumptions (e.g., Laplacian or Gaussian) hinders their ability to adapt to diverse target characteristics such as varying sizes and motion patterns, ultimately affecting tracking precision and robustness. To address these issues, we propose BEVTrack, a simple yet effective motion-based tracking method. BEVTrack directly estimates object motion in Bird's-Eye View (BEV) using a single regression loss. To enhance accuracy for targets with diverse attributes, it learns adaptive likelihood functions tailored to individual targets, avoiding the limitations of fixed distribution assumptions in previous methods. This approach provides valuable priors for tracking and significantly boosts performance. Comprehensive experiments on KITTI, NuScenes, and Waymo Open Dataset demonstrate that BEVTrack achieves state-of-the-art results while operating at 200 FPS, enabling real-time applicability. The code will be released at https://github.com/xmm-prio/BEVTrack.
comment: IJCAI version
♻ ☆ Efficient Multi Subject Visual Reconstruction from fMRI Using Aligned Representations
This work introduces a novel approach to fMRI-based visual image reconstruction using a subject-agnostic common representation space. We show that the brain signals of the subjects can be aligned in this common space during training to form a semantically aligned common brain. This is leveraged to demonstrate that aligning subject-specific lightweight modules to a reference subject is significantly more efficient than traditional end-to-end training methods. Our approach excels in low-data scenarios. We evaluate our methods on different datasets, demonstrating that the common space is subject and dataset-agnostic.
♻ ☆ Scalable Cosmic AI Inference using Cloud Serverless Computing
Large-scale astronomical image data processing and prediction are essential for astronomers, providing crucial insights into celestial objects, the universe's history, and its evolution. While modern deep learning models offer high predictive accuracy, they often demand substantial computational resources, making them resource-intensive and limiting accessibility. We introduce the Cloud-based Astronomy Inference (CAI) framework to address these challenges. This scalable solution integrates pre-trained foundation models with serverless cloud infrastructure through a Function-as-a-Service (FaaS). CAI enables efficient and scalable inference on astronomical images without extensive hardware. Using a foundation model for redshift prediction as a case study, our extensive experiments cover user devices, HPC (High-Performance Computing) servers, and Cloud. Using redshift prediction with the AstroMAE model demonstrated CAI's scalability and efficiency, achieving inference on a 12.6 GB dataset in only 28 seconds compared to 140.8 seconds on HPC GPUs and 1793 seconds on HPC CPUs. CAI also achieved significantly higher throughput, reaching 18.04 billion bits per second (bps), and maintained near-constant inference times as data sizes increased, all at minimal computational cost (under $5 per experiment). We also process large-scale data up to 1 TB to show CAI's effectiveness at scale. CAI thus provides a highly scalable, accessible, and cost-effective inference solution for the astronomy community. The code is accessible at https://github.com/UVA-MLSys/AI-for-Astronomy.
♻ ☆ Controllable Hybrid Captioner for Improved Long-form Video Understanding
Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.
♻ ☆ MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling
Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi-agent collaborative framework designed to assist in long-sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.
comment: Video Generation Agent
♻ ☆ High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization WACV 2026
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
comment: WACV 2026: Winter Conference on Applications of Computer Vision 2026
♻ ☆ Making Bias Amplification in Balanced Datasets Directional and Interpretable
Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute A (e.g., gender) and a task T (e.g., cooking). However, these metrics fail to measure biases when A is balanced with T. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.
♻ ☆ Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.
♻ ☆ Measuring directional bias amplification in image captions using predictability
When we train models on biased ML datasets, they not only learn these biases but can inflate them at test time - a phenomenon called bias amplification. To measure bias amplification in ML datasets, many co-occurrence-based metrics have been proposed. Co-occurrence-based metrics are effective in measuring bias amplification in simple problems like image classification. However, these metrics are ineffective for complex problems like image captioning as they cannot capture the semantics of a caption. To measure bias amplification in captions, prior work introduced a predictability-based metric called Leakage in Captioning (LIC). While LIC captures the semantics and context of captions, it has limitations. LIC cannot identify the direction in which bias is amplified, poorly estimates dataset bias due to a weak vocabulary substitution strategy, and is highly sensitive to attacker models (a hyperparameter in predictability-based metrics). To overcome these issues, we propose Directional Predictability Amplification in Captioning (DPAC). DPAC measures directional bias amplification in captions, provides a better estimate of dataset bias using an improved substitution strategy, and is less sensitive to attacker models. Our experiments on the COCO captioning dataset show how DPAC is the most reliable metric to measure bias amplification in captions.
♻ ☆ Interpretable and Granular Video-Based Quantification of Motor Characteristics from the Finger Tapping Test in Parkinson Disease
Accurately quantifying motor characteristics in Parkinson disease (PD) is crucial for monitoring disease progression and optimizing treatment strategies. The finger-tapping test is a standard motor assessment. Clinicians visually evaluate a patient's tapping performance and assign an overall severity score based on tapping amplitude, speed, and irregularity. However, this subjective evaluation is prone to inter- and intra-rater variability, and does not offer insights into individual motor characteristics captured during this test. This paper introduces a granular computer vision-based method for quantifying PD motor characteristics from video recordings. Four sets of clinically relevant features are proposed to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. We evaluate our approach on video recordings and clinical evaluations of 74 PD patients from the Personalized Parkinson Project. Principal component analysis with varimax rotation shows that the video-based features corresponded to the four deficits. Additionally, video-based analysis has allowed us to identify further granular distinctions within sequence effect and hesitation-halts deficits. In the following, we have used these features to train machine learning classifiers to estimate the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score. Compared to state-of-the-art approaches, our method achieves a higher accuracy in MDS-UPDRS score prediction, while still providing an interpretable quantification of individual finger-tapping motor characteristics. In summary, the proposed framework provides a practical solution for the objective assessment of PD motor characteristics, that can potentially be applied in both clinical and remote settings. Future work is needed to assess its responsiveness to symptomatic treatment and disease progression.
♻ ☆ A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis
Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of "positive" and "negative" samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.
comment: 38 pages, 8 figures, survey paper
♻ ☆ Through the Perspective of LiDAR: A Feature-Enriched and Uncertainty-Aware Annotation Pipeline for Terrestrial Point Cloud Segmentation
Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi-automated, uncertainty-aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi-source features, and trains an ensemble of segmentation networks to produce pseudo-labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back-projected to 3D, yielding densely annotated point clouds supported by a three-tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine-channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature-enrichment strategy through cross-dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty-aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high-quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at https://fz-rit.github.io/through-the-lidars-eye/.
comment: 40 pages (28 main text), 20 figures, 4 supplementary materials; links to 3D point animations are included in the last table
♻ ☆ HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous settings, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring the necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, socially responsible navigation research.
comment: 33 pages, 20 figures, website: https://ha-vln-project.vercel.app/
♻ ☆ Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration ICCV
Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we provide the first rigorous analysis of this approximation error for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. Motivated by our theoretical insights, we propose a simple modification to existing diffusion-based restoration methods. Our approach introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We develop an adaptive curriculum for this frequency schedule based on the underlying data distribution. Our method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.
comment: Accepted at International Conference on Computer Vision (ICCV) 2025
Machine Learning 322
☆ BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation
Scaling data and models has played a pivotal role in the remarkable progress of computer vision and language. Inspired by these domains, recent efforts in robotics have similarly focused on scaling both data and model size to develop more generalizable and robust policies. However, unlike vision and language, robotics lacks access to internet-scale demonstrations across diverse robotic tasks and environments. As a result, the scale of existing datasets typically suffers from the need for manual data collection and curation. To address this problem, here we propose BLAZER, a framework that learns manipulation policies from automatically generated training data. We build on the zero-shot capabilities of LLM planners and automatically generate demonstrations for diverse manipulation tasks in simulation. Successful examples are then used to finetune an LLM and to improve its planning capabilities without human supervision. Notably, while BLAZER training requires access to the simulator's state, we demonstrate direct transfer of acquired skills to sensor-based manipulation. Through extensive experiments, we show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments. Moreover, BLAZER improves on tasks outside of its training pool and enables downscaling of LLM models. Our code and data will be made publicly available on the project page.
comment: 11 pages, 8 figures
☆ Reconstructing the local density field with combined convolutional and point cloud architecture NeurIPS 2025
We construct a neural network to perform regression on the local dark-matter density field given line-of-sight peculiar velocities of dark-matter halos, biased tracers of the dark matter field. Our architecture combines a convolutional U-Net with a point-cloud DeepSets. This combination enables efficient use of small-scale information and improves reconstruction quality relative to a U-Net-only approach. Specifically, our hybrid network recovers both clustering amplitudes and phases better than the U-Net on small scales.
comment: 6 pages, 4 figures, 1 table. Accepted at the NeurIPS 2025 Workshop: ML4PS. Comments welcome!
☆ Who Said Neural Networks Aren't Linear?
Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, $f$$:$$X$$\to$$Y$. Is it possible to identify a pair of non-standard vector spaces for which a conventionally nonlinear function is, in fact, linear? This paper introduces a method that makes such vector spaces explicit by construction. We find that if we sandwich a linear operator $A$ between two invertible neural networks, $f(x)=g_y^{-1}(A g_x(x))$, then the corresponding vector spaces $X$ and $Y$ are induced by newly defined addition and scaling actions derived from $g_x$ and $g_y$. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. $f(f(x))=f(x)$) on networks leading to a globally projective generative model and to demonstrate modular style transfer.
☆ ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.
comment: Preprint
☆ How to Teach Large Multimodal Models New Skills
How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL
comment: In submission. Code is available at https://github.com/jessemelpolio/LMM_CL
☆ Where Have All the Kaczmarz Iterates Gone?
The randomized Kaczmarz (RK) algorithm is one of the most computationally and memory-efficient iterative algorithms for solving large-scale linear systems. However, practical applications often involve noisy and potentially inconsistent systems. While the convergence of RK is well understood for consistent systems, the study of RK on noisy, inconsistent linear systems is limited. This paper investigates the asymptotic behavior of RK iterates in expectation when solving noisy and inconsistent systems, addressing the locations of their limit points. We explore the roles of singular vectors of the (noisy) coefficient matrix and derive bounds on the convergence horizon, which depend on the noise levels and system characteristics. Finally, we provide extensive numerical experiments that validate our theoretical findings, offering practical insights into the algorithm's performance under realistic conditions. These results establish a deeper understanding of the RK algorithm's limitations and robustness in noisy environments, paving the way for optimized applications in real-world scientific and engineering problems.
☆ Agent Learning via Early Experience
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
comment: Work in progress
☆ Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbf{Group Diffusion Policy Optimization (GDPO)}, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
☆ Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints
We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.
☆ SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode phase. To efficiently serve LLMs, prior work proposes prefill-decode disaggregation to run each phase on separate hardware. However, existing hardware poorly matches the different requirements of each phase. Current datacenter GPUs and TPUs follow a more-is-better design philosophy that maximizes compute and memory resources, causing memory bandwidth underutilization in the prefill phase and compute underutilization in the decode phase. Such underutilization directly translates into increased serving costs. This paper proposes SPAD (Specialized Prefill and Decode hardware), adopting a less-is-more methodology to design specialized chips tailored to the distinct characteristics of prefill and decode phases. The proposed Prefill Chips have larger systolic arrays and use cost-effective GDDR memory, whereas the proposed Decode Chips retain high memory bandwidth but reduce compute capacity. Compared to modeled H100s, simulations show that the proposed Prefill Chips deliver 8% higher prefill performance on average at 52% lower hardware cost, while the proposed Decode Chips achieve 97% of the decode performance with 28% lower TDP. End-to-end simulations on production traces show that SPAD reduces hardware cost by 19%-41% and TDP by 2%-17% compared to modeled baseline clusters while offering the same performance. Even when models and workloads change, SPAD can reallocate either type of chip to run either phase and still achieve 11%-43% lower hardware costs, demonstrating the longevity of the SPAD design.
☆ Computational and statistical lower bounds for low-rank estimation under general inhomogeneous noise
Recent work has generalized several results concerning the well-understood spiked Wigner matrix model of a low-rank signal matrix corrupted by additive i.i.d. Gaussian noise to the inhomogeneous case, where the noise has a variance profile. In particular, for the special case where the variance profile has a block structure, a series of results identified an effective spectral algorithm for detecting and estimating the signal, identified the threshold signal strength required for that algorithm to succeed, and proved information-theoretic lower bounds that, for some special signal distributions, match the above threshold. We complement these results by studying the computational optimality of this spectral algorithm. Namely, we show that, for a much broader range of signal distributions, whenever the spectral algorithm cannot detect a low-rank signal, then neither can any low-degree polynomial algorithm. This gives the first evidence for a computational hardness conjecture of Guionnet, Ko, Krzakala, and Zdeborov\'a (2023). With similar techniques, we also prove sharp information-theoretic lower bounds for a class of signal distributions not treated by prior work. Unlike all of the above results on inhomogeneous models, our results do not assume that the variance profile has a block structure, and suggest that the same spectral algorithm might remain optimal for quite general profiles. We include a numerical study of this claim for an example of a smoothly-varying rather than piecewise-constant profile. Our proofs involve analyzing the graph sums of a matrix, which also appear in free and traffic probability, but we require new bounds on these quantities that are tighter than existing ones for non-negative matrices, which may be of independent interest.
comment: 52 pages, 3 figures
☆ On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.
☆ Permutation-Invariant Spectral Learning via Dyson Diffusion
Diffusion models are central to generative modeling and have been adapted to graphs by diffusing adjacency matrix representations. The challenge of having up to $n!$ such representations for graphs with $n$ nodes is only partially mitigated by using permutation-equivariant learning architectures. Despite their computational efficiency, existing graph diffusion models struggle to distinguish certain graph families, unless graph data are augmented with ad hoc features. This shortcoming stems from enforcing the inductive bias within the learning architecture. In this work, we leverage random matrix theory to analytically extract the spectral properties of the diffusion process, allowing us to push the inductive bias from the architecture into the dynamics. Building on this, we introduce the Dyson Diffusion Model, which employs Dyson's Brownian Motion to capture the spectral dynamics of an Ornstein-Uhlenbeck process on the adjacency matrix while retaining all non-spectral information. We demonstrate that the Dyson Diffusion Model learns graph spectra accurately and outperforms existing graph diffusion models.
☆ Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning NeurIPS 2025
In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
comment: Accepted to NeurIPS 2025. First two authors contributed equally
☆ DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
Existing batch size selection approaches in dis- tributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequen- tial decision-making problem using Proximal Policy Optimiza- tion (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse work- loads, hardware configurations, and network conditions, DY- NAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.
☆ CaRT: Teaching LLM Agents to Know When They Know Enough
Many tasks require learned models to strategically gather relevant information over multiple rounds of interaction before actually acting on a task. Strategic information gathering requires models to know not only how to effectively acquire information, but also when to stop gathering information and make a decision, in order to avoid overthinking or getting derailed when acting. In this paper, we formalize this problem and introduce Counterfactuals and Reasoning for Termination (CaRT), an approach for teaching LLMs when to stop seeking information. To appropriately learn when to terminate, CaRT fine-tunes LLMs using counterfactual pairs of trajectories, one where termination is appropriate and a minimally modified version of the same trajectory where it is not. It trains the LLM to explain the rationale for the termination decision in either case via verbal reasoning, and imbues this capability into the base LLM via fine-tuning. We instantiate CaRT in two domains: interactive medical diagnosis and math problem solving. In both domains, we find that CaRT improves the efficiency of information gathering and task success rate compared to other fine-tuning methods.
☆ AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents
Large language models (LLMs) have shown impressive performance in general programming tasks. However, in Machine Learning Engineering (MLE) scenarios such as AutoML and Kaggle competitions, achieving high performance depends heavily on expert intervention and repeated adjustments rather than simply generating correct code. When applied directly to these tasks, LLMs often lack fine-grained domain priors, and existing MLE approaches that use linear or tree-structured searches limit knowledge transfer to adjacent hierarchical links. As a result, they cannot leverage past full trajectories or share information across branches, limiting self-evolving ability and search space diversity. To address these limitations, we introduce AutoMLGen, an LLM-based coding agent that integrates a domain knowledge base for high-quality prior guidance and Monte Carlo Graph Search (MCGS) for efficient exploration. MCGS retains the tree-guided exploration of MCTS while embedding a graph structure into the expansion stage to enable dynamic path reorganization, historical trajectory reuse, and multi-solution fusion to support both self-evolution and collaborative learning. Combined with fine-grained operator sets, this design improves stability and accelerates convergence. Evaluation on the MLE-Bench shows that AutoMLGen achieves state-of-the-art performance in numerous dimensions, such as the average medal rate and the valid submission rate, under a 12-hour budget (half the standard runtime). The code is available at https://github.com/Alpha-Innovator/InternAgent.
☆ AI-Driven Radiology Report Generation for Traumatic Brain Injuries
Traumatic brain injuries present significant diagnostic challenges in emergency medicine, where the timely interpretation of medical images is crucial for patient outcomes. In this paper, we propose a novel AI-based approach for automatic radiology report generation tailored to cranial trauma cases. Our model integrates an AC-BiFPN with a Transformer architecture to capture and process complex medical imaging data such as CT and MRI scans. The AC-BiFPN extracts multi-scale features, enabling the detection of intricate anomalies like intracranial hemorrhages, while the Transformer generates coherent, contextually relevant diagnostic reports by modeling long-range dependencies. We evaluate the performance of our model on the RSNA Intracranial Hemorrhage Detection dataset, where it outperforms traditional CNN-based models in both diagnostic accuracy and report generation. This solution not only supports radiologists in high-pressure environments but also provides a powerful educational tool for trainee physicians, offering real-time feedback and enhancing their learning experience. Our findings demonstrate the potential of combining advanced feature extraction with transformer-based text generation to improve clinical decision-making in the diagnosis of traumatic brain injuries.
☆ Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/
comment: 63 pages, 29 tables, and 47 figures
☆ Implementing Semantic Join Operators Efficiently
Semantic query processing engines often support semantic joins, enabling users to match rows that satisfy conditions specified in natural language. Such join conditions can be evaluated using large language models (LLMs) that solve novel tasks without task-specific training. Currently, many semantic query processing engines implement semantic joins via nested loops, invoking the LLM to evaluate the join condition on row pairs. Instead, this paper proposes a novel algorithm, inspired by the block nested loops join operator implementation in traditional database systems. The proposed algorithm integrates batches of rows from both input tables into a single prompt. The goal of the LLM invocation is to identify all matching row pairs in the current input. The paper introduces formulas that can be used to optimize the size of the row batches, taking into account constraints on the size of the LLM context window (limiting both input and output size). An adaptive variant of the proposed algorithm refers to cases in which the size of the output is difficult to estimate. A formal analysis of asymptotic processing costs, as well as empirical results, demonstrates that the proposed approach reduces costs significantly and performs well compared to join implementations used by recent semantic query processing engines.
☆ DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos
We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.
comment: Video results are available at: https://embodiedai-ntu.github.io/dexman/index.html
☆ Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.
comment: Accepted to the EMNLP 2025 BabyLM Workshop
☆ In-Context Clustering with Large Language Models
We propose In-Context Clustering (ICC), a flexible LLM-based procedure for clustering data from diverse distributions. Unlike traditional clustering algorithms constrained by predefined similarity measures, ICC flexibly captures complex relationships among inputs through an attention mechanism. We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data, with attention matrices showing salient cluster patterns. Spectral clustering using attention matrices offers surprisingly competitive performance. We further enhance the clustering capabilities of LLMs on numeric and image data through fine-tuning using the Next Token Prediction (NTP) loss. Moreover, the flexibility of LLM prompting enables text-conditioned image clustering, a capability that classical clustering methods lack. Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering. Our code is available at https://agenticlearning.ai/icc.
☆ Accelerated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models
Recent advances in supervised learning have driven growing interest in explaining black-box models, particularly by estimating the effects of input variables on model predictions. However, existing approaches often face key limitations, including poor scalability, sensitivity to out-of-distribution sampling, and instability under correlated features. To address these issues, we propose A2D2E, an $\textbf{E}$stimator based on $\textbf{A}$ccelerated $\textbf{A}$ggregated $\textbf{D}$-Optimal $\textbf{D}$esigns. Our method leverages principled experimental design to improve efficiency and robustness in main effect estimation. We establish theoretical guarantees, including convergence and variance reduction, and validate A2D2E through extensive simulations. We further provide the potential of the proposed method with a case study on real data and applications in language models. The code to reproduce the results can be found at https://github.com/cchihyu/A2D2E.
☆ Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered
Vision-Language-Action (VLA) models have advanced robotic capabilities but remain challenging to deploy on resource-limited hardware. Pruning has enabled efficient compression of large language models (LLMs), yet it is largely understudied in robotics. Surprisingly, we observe that pruning VLA models leads to drastic degradation and increased safety violations. We introduce GLUESTICK, a post-pruning recovery method that restores much of the original model's functionality while retaining sparsity benefits. Our method performs a one-time interpolation between the dense and pruned models in weight-space to compute a corrective term. This correction is used during inference by each pruned layer to recover lost capabilities with minimal overhead. GLUESTICK requires no additional training, is agnostic to the pruning algorithm, and introduces a single hyperparameter that controls the tradeoff between efficiency and accuracy. Across diverse VLA architectures and tasks in manipulation and navigation, GLUESTICK achieves competitive memory efficiency while substantially recovering success rates and reducing safety violations. Additional material can be found at: https://gluestick-vla.github.io/.
☆ Wavefunction Flows: Efficient Quantum Simulation of Continuous Flow Models
Flow models are a cornerstone of modern machine learning. They are generative models that progressively transform probability distributions according to learned dynamics. Specifically, they learn a continuous-time Markov process that efficiently maps samples from a simple source distribution into samples from a complex target distribution. We show that these models are naturally related to the Schr\"odinger equation, for an unusual Hamiltonian on continuous variables. Moreover, we prove that the dynamics generated by this Hamiltonian can be efficiently simulated on a quantum computer. Together, these results give a quantum algorithm for preparing coherent encodings (a.k.a., qsamples) for a vast family of probability distributions--namely, those expressible by flow models--by reducing the task to an existing classical learning problem, plus Hamiltonian simulation. For statistical problems defined by flow models, such as mean estimation and property testing, this enables the use of quantum algorithms tailored to qsamples, which may offer advantages over classical algorithms based only on samples from a flow model. More broadly, these results reveal a close connection between state-of-the-art machine learning models, such as flow matching and diffusion models, and one of the main expected capabilities of quantum computers: simulating quantum dynamics.
☆ SummDiff: Generative Modeling of Video Summarization with Diffusion
Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a good summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.
☆ Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning
Activation functions govern the expressivity and stability of neural networks, yet existing comparisons remain largely heuristic. We propose a rigorous framework for their classification via a nine-dimensional integral signature S_sigma(phi), combining Gaussian propagation statistics (m1, g1, g2, m2, eta), asymptotic slopes (alpha_plus, alpha_minus), and regularity measures (TV(phi'), C(phi)). This taxonomy establishes well-posedness, affine reparameterization laws with bias, and closure under bounded slope variation. Dynamical analysis yields Lyapunov theorems with explicit descent constants and identifies variance stability regions through (m2', g2). From a kernel perspective, we derive dimension-free Hessian bounds and connect smoothness to bounded variation of phi'. Applying the framework, we classify eight standard activations (ReLU, leaky-ReLU, tanh, sigmoid, Swish, GELU, Mish, TeLU), proving sharp distinctions between saturating, linear-growth, and smooth families. Numerical Gauss-Hermite and Monte Carlo validation confirms theoretical predictions. Our framework provides principled design guidance, moving activation choice from trial-and-error to provable stability and kernel conditioning.
comment: 25 pages
☆ gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity
Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node's representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.
comment: 22 pages, 22 figures, 7 tables
☆ Synthetic Series-Symbol Data Generation for Time Series Foundation Models NeurIPS 2025
Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop \texttt{SymTime}, a pre-trained foundation model for enhancing time series representation using symbolic information. \texttt{SymTime} demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.
comment: 63 pages, NeurIPS 2025 accepted
☆ xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning
Modern LLM deployments confront a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. Static escalation rules and keyword heuristics under-utilize this spectrum and fail to adapt across task types. We present xRouter, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models. The router is trained end-to-end with reinforcement learning using an explicit, cost-aware reward that encodes cost-performance trade-offs, eliminating the need for hand-engineered routing rules. Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting, as well as the deployment and evaluation pipelines. Across diverse benchmarks, xRouter achieves strong cost-performance trade-offs (e.g., substantial cost reductions at comparable task completion rates), and provides empirical insights into what reliably helps learned routing and what does not, ranging from model trainability to the difficulty of eliciting sophisticated orchestration behaviors in small open models. We hope these findings and our open implementation will serve as a practical substrate for advancing learned, cost-aware LLM orchestration.
comment: 24 Pages, 4 Figures, 2 Tables
☆ Navigating Sparsities in High-Dimensional Linear Contextual Bandits
High-dimensional linear contextual bandit problems remain a significant challenge due to the curse of dimensionality. Existing methods typically consider either the model parameters to be sparse or the eigenvalues of context covariance matrices to be (approximately) sparse, lacking general applicability due to the rigidity of conventional reward estimators. To overcome this limitation, a powerful pointwise estimator is introduced in this work that adaptively navigates both kinds of sparsity. Based on this pointwise estimator, a novel algorithm, termed HOPE, is proposed. Theoretical analyses demonstrate that HOPE not only achieves improved regret bounds in previously discussed homogeneous settings (i.e., considering only one type of sparsity) but also, for the first time, efficiently handles two new challenging heterogeneous settings (i.e., considering a mixture of two types of sparsity), highlighting its flexibility and generality. Experiments corroborate the superiority of HOPE over existing methods across various scenarios.
☆ Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
☆ ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing
Reinsurance treaty pricing must satisfy stringent regulatory standards, yet current quoting practices remain opaque and difficult to audit. We introduce ClauseLens, a clause-grounded reinforcement learning framework that produces transparent, regulation-compliant, and risk-aware treaty quotes. ClauseLens models the quoting task as a Risk-Aware Constrained Markov Decision Process (RA-CMDP). Statutory and policy clauses are retrieved from legal and underwriting corpora, embedded into the agent's observations, and used both to constrain feasible actions and to generate clause-grounded natural language justifications. Evaluated in a multi-agent treaty simulator calibrated to industry data, ClauseLens reduces solvency violations by 51%, improves tail-risk performance by 27.9% (CVaR_0.10), and achieves 88.2% accuracy in clause-grounded explanations with retrieval precision of 87.4% and recall of 91.1%. These findings demonstrate that embedding legal context into both decision and explanation pathways yields interpretable, auditable, and regulation-aligned quoting behavior consistent with Solvency II, NAIC RBC, and the EU AI Act.
comment: Accepted for publication at the 6th ACM International Conference on AI in Finance (ICAIF 2025), Singapore. Author-accepted version (October 2025). 10 pages, 5 figures
☆ Reinforcing Diffusion Models by Direct Group Preference Optimization
While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.
☆ Prompts Generalize with Low Data: Non-vacuous Generalization Bounds for Optimizing Prompts with More Informative Priors ICML 2025
Many prompt engineering techniques have been successful in practice, even when optimizing over a large prompt space with with a small amount of task-specific data. Recent work has partially explained this success by showing generalization bounds which apply PAC-Bayes theory to the discrete prompt space, but they are non-vacuous only in data-rich scenarios. We argue that such widespread success can be more fully explained through more carefully considering data- or distribution-dependent perplexity, which acts as an effective prior and steers the optimization towards prompts that are more ``natural'' for the task at hand. We derive novel generalization bounds that are non-vacuous for data-scarce prompt optimization via more useful priors, formally analyzing how perplexity regularization tightens these bounds by limiting exploration. Empirically, we explore both the bounds' effectiveness and the practical benefits of perplexity regularization in improving prompt generalization.
comment: EXAIT Workshop paper at ICML 2025
☆ Optimal Stopping in Latent Diffusion Models
We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.
☆ Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin
The mechanosensory system of teeth is currently believed to partly rely on Odontoblast cells stimulation by fluid flow through a porosity network extending through dentin. Visualizing the smallest sub-microscopic porosity vessels therefore requires the highest achievable resolution from confocal fluorescence microscopy, the current gold standard. This considerably limits the extent of the field of view to very small sample regions. To overcome this limitation, we tested different deep learning (DL) super-resolution (SR) models to allow faster experimental acquisitions of lower resolution images and restore optimal image quality by post-processing. Three supervised 2D SR models (RCAN, pix2pix, FSRCNN) and one unsupervised (CycleGAN) were applied to a unique set of experimentally paired high- and low-resolution confocal images acquired with different sampling schemes, resulting in a pixel size increase of x2, x4, x8. Model performance was quantified using a broad set of similarity and distribution-based image quality assessment (IQA) metrics, which yielded inconsistent results that mostly contradicted our visual perception. This raises the question of the relevance of such generic metrics to efficiently target the specific structure of dental porosity. To resolve this conflicting information, the generated SR images were segmented taking into account the specific scales and morphology of the porosity network and analysed by comparing connected components. Additionally, the capacity of the SR models to preserve 3D porosity connectivity throughout the confocal image stacks was evaluated using graph analysis. This biology-driven assessment allowed a far better mechanistic interpretation of SR performance, highlighting differences in model sensitivity to weak intensity features and the impact of non-linearity in image generation, which explains the failure of standard IQA metrics.
☆ Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT
We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.
☆ FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts NeurIPS 2025
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
comment: NeurIPS 2025 accepted paper
☆ Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions
In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension \citep{natarajan1989learning} and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.
comment: 9 pages
☆ Contrastive Self-Supervised Learning at the Edge: An Energy Perspective
While contrastive learning (CL) shows considerable promise in self-supervised representation learning, its deployment on resource-constrained devices remains largely underexplored. The substantial computational demands required for training conventional CL frameworks pose a set of challenges, particularly in terms of energy consumption, data availability, and memory usage. We conduct an evaluation of four widely used CL frameworks: SimCLR, MoCo, SimSiam, and Barlow Twins. We focus on the practical feasibility of these CL frameworks for edge and fog deployment, and introduce a systematic benchmarking strategy that includes energy profiling and reduced training data conditions. Our findings reveal that SimCLR, contrary to its perceived computational cost, demonstrates the lowest energy consumption across various data regimes. Finally, we also extend our analysis by evaluating lightweight neural architectures when paired with CL frameworks. Our study aims to provide insights into the resource implications of deploying CL in edge/fog environments with limited processing capabilities and opens several research directions for its future optimization.
☆ On the Relationship Between the Choice of Representation and In-Context Learning
In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.
comment: 25 pages, 6 figures, 10 tables
Guided Star-Shaped Masked Diffusion
The performance of pre-trained masked diffusion models is often constrained by their sampling procedure, which makes decisions irreversible and struggles in low-step generation regimes. We introduce a novel sampling algorithm that works with pre-trained models and, after a lightweight fine-tuning of a single layer, significantly improves sample quality and efficiency. Our method reformulates the generation process using a star-shaped paradigm, which inherently allows for error correction. To make this process effective, we augment it with a learnable re-masking scheduler that intelligently identifies and revises likely errors. This approach yields a substantial quality boost, particularly when using a small number of sampling steps. We extensively ablate key components of our approach and show its usability in different scenarios. In comprehensive experiments on text, and code generation, our sampling algorithm outperforms or matches existing methods.
☆ DeepEN: Personalized Enteral Nutrition for Critically Ill Patients using Deep Reinforcement Learning
We introduce DeepEN, a deep reinforcement learning (RL) framework for personalized enteral nutrition (EN) in critically ill patients. Trained offline on over 11,000 ICU patients from the MIMIC-IV database, DeepEN generates 4-hourly recommendations for caloric, protein, and fluid intake tailored to each patient's evolving physiology. The model integrates a curated, clinically informed state space with a custom reward function that balances short-term physiological and nutrition-related goals with long-term survival outcomes. Using a dueling double deep Q-network with conservative Q-learning regularization, DeepEN learns clinically realistic policies that align with high-value clinician actions while discouraging unsafe deviations. Across various qualitative and quantitative metrics, DeepEN outperforms clinician-derived and guideline-based policies, achieving a 3.7 $\pm$ 0.17 percentage-point reduction in estimated mortality (18.8% vs 22.5%) and improvements in key nutritional biomarkers. These findings highlight the potential of safe, data-driven personalization of EN therapy to improve outcomes beyond traditional guideline- or heuristic-based approaches.
☆ Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization
We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.
comment: 10 pages, 5 figures, 2 tables
☆ PAC Learnability in the Presence of Performativity
Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.
comment: 21 pages, 3 figures
☆ New Machine Learning Approaches for Intrusion Detection in ADS-B
With the growing reliance on the vulnerable Automatic Dependent Surveillance-Broadcast (ADS-B) protocol in air traffic management (ATM), ensuring security is critical. This study investigates emerging machine learning models and training strategies to improve AI-based intrusion detection systems (IDS) for ADS-B. Focusing on ground-based ATM systems, we evaluate two deep learning IDS implementations: one using a transformer encoder and the other an extended Long Short-Term Memory (xLSTM) network, marking the first xLSTM-based IDS for ADS-B. A transfer learning strategy was employed, involving pre-training on benign ADS-B messages and fine-tuning with labeled data containing instances of tampered messages. Results show this approach outperforms existing methods, particularly in identifying subtle attacks that progressively undermine situational awareness. The xLSTM-based IDS achieves an F1-score of 98.9%, surpassing the transformer-based model at 94.3%. Tests on unseen attacks validated the generalization ability of the xLSTM model. Inference latency analysis shows that the 7.26-second delay introduced by the xLSTM-based IDS fits within the Secondary Surveillance Radar (SSR) refresh interval (5-12 s), although it may be restrictive for time-critical operations. While the transformer-based IDS achieves a 2.1-second latency, it does so at the cost of lower detection performance.
comment: This is the author's version of the work accepted for publication Digital Avionics Systems Conference (DASC) 2025. The final version will be available via IEEE Xplore
☆ Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
comment: 10 pages, 3 figures
☆ Iterated Agent for Symbolic Regression
Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter's efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at https://www.ideasearch.cn/.
comment: 45 pages, 22 figures, 8 tables
☆ To Ask or Not to Ask: Learning to Require Human Feedback
Developing decision-support systems that complement human performance in classification tasks remains an open challenge. A popular approach, Learning to Defer (LtD), allows a Machine Learning (ML) model to pass difficult cases to a human expert. However, LtD treats humans and ML models as mutually exclusive decision-makers, restricting the expert contribution to mere predictions. To address this limitation, we propose Learning to Ask (LtA), a new framework that handles both when and how to incorporate expert input in an ML model. LtA is based on a two-part architecture: a standard ML model and an enriched model trained with additional expert human feedback, with a formally optimal strategy for selecting when to query the enriched model. We provide two practical implementations of LtA: a sequential approach, which trains the models in stages, and a joint approach, which optimises them simultaneously. For the latter, we design surrogate losses with realisable-consistency guarantees. Our experiments with synthetic and real expert data demonstrate that LtA provides a more flexible and powerful foundation for effective human-AI collaboration.
☆ Robust and Efficient Collaborative Learning
Collaborative machine learning is challenged by training-time adversarial behaviors. Existing approaches to tolerate such behaviors either rely on a central server or induce high communication costs. We propose Robust Pull-based Epidemic Learning (RPEL), a novel, scalable collaborative approach to ensure robust learning despite adversaries. RPEL does not rely on any central server and, unlike traditional methods, where communication costs grow in $\mathcal{O}(n^2)$ with the number of nodes $n$, RPEL employs a pull-based epidemic-based communication strategy that scales in $\mathcal{O}(n \log n)$. By pulling model parameters from small random subsets of nodes, RPEL significantly lowers the number of required messages without compromising convergence guarantees, which hold with high probability. Empirical results demonstrate that RPEL maintains robustness in adversarial settings, competes with all-to-all communication accuracy, and scales efficiently across large networks.
☆ Dynamic Features Adaptation in Networking: Toward Flexible training and Explainable inference NeurIPS 2025
As AI becomes a native component of 6G network control, AI models must adapt to continuously changing conditions, including the introduction of new features and measurements driven by multi-vendor deployments, hardware upgrades, and evolving service requirements. To address this growing need for flexible learning in non-stationary environments, this vision paper highlights Adaptive Random Forests (ARFs) as a reliable solution for dynamic feature adaptation in communication network scenarios. We show that iterative training of ARFs can effectively lead to stable predictions, with accuracy improving over time as more features are added. In addition, we highlight the importance of explainability in AI-driven networks, proposing Drift-Aware Feature Importance (DAFI) as an efficient XAI feature importance (FI) method. DAFI uses a distributional drift detector to signal when to apply computationally intensive FI methods instead of lighter alternatives. Our tests on 3 different datasets indicate that our approach reduces runtime by up to 2 times, while producing more consistent feature importance values. Together, ARFs and DAFI provide a promising framework to build flexible AI methods adapted to 6G network use-cases.
comment: Accepted at AI4NextG Workshop, NeurIPS 2025
☆ Bridging the Physics-Data Gap with FNO-Guided Conditional Flow Matching: Designing Inductive Bias through Hierarchical Physical Constraints
Conventional time-series generation often ignores domain-specific physical constraints, limiting statistical and physical consistency. We propose a hierarchical framework that embeds the inherent hierarchy of physical laws-conservation, dynamics, boundary, and empirical relations-directly into deep generative models, introducing a new paradigm of physics-informed inductive bias. Our method combines Fourier Neural Operators (FNOs) for learning physical operators with Conditional Flow Matching (CFM) for probabilistic generation, integrated via time-dependent hierarchical constraints and FNO-guided corrections. Experiments on harmonic oscillators, human activity recognition, and lithium-ion battery degradation show 16.3% higher generation quality, 46% fewer physics violations, and 18.5% improved predictive accuracy over baselines.
comment: 8 pages, 1 figure
☆ Counterfactual Identifiability via Dynamic Optimal Transport NeurIPS 2025
We address the open question of counterfactual identification for high-dimensional multivariate outcomes from observational data. Pearl (2000) argues that counterfactuals must be identifiable (i.e., recoverable from the observed data distribution) to justify causal claims. A recent line of work on counterfactual inference shows promising results but lacks identification, undermining the causal validity of its estimates. To address this, we establish a foundation for multivariate counterfactual identification using continuous-time flows, including non-Markovian settings under standard criteria. We characterise the conditions under which flow matching yields a unique, monotone and rank-preserving counterfactual transport map with tools from dynamic optimal transport, ensuring consistent inference. Building on this, we validate the theory in controlled scenarios with counterfactual ground-truth and demonstrate improvements in axiomatic counterfactual soundness on real images.
comment: Accepted at NeurIPS 2025
☆ Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization
Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.
☆ Opponent Shaping in LLM Agents
Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-players' learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner's Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner's Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.
comment: 29 pages, 15 figures, 15 tables
☆ Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.
comment: 13 pages, 3 figures
☆ The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Large Language Models (LLMs) are increas- ingly integral to information dissemination and decision-making processes. Given their grow- ing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propa- gation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the in- herent political leanings of these models. Sub- sequently, persona prompting with the PCT is used to explore explicit stereotypes across vari- ous social dimensions. In a final step, implicit stereotypes are uncovered by evaluating mod- els with multilingual versions of the PCT. Key findings reveal a consistent left-leaning polit- ical alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those iden- tified via explicit persona prompting. Interest- ingly, for most models, implicit and explicit stereotypes show a notable alignment, suggest- ing a degree of transparency or "awareness" regarding their inherent biases. This study un- derscores the complex interplay of political bias and stereotypes in LLMs.
☆ Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $42.9\%$ over previously SOTA baselines and $55.8\%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.
☆ Reinforcement Learning from Probabilistic Forecasts for Safe Decision-Making via Conditional Value-at-Risk Planning
Sequential decisions in volatile, high-stakes settings require more than maximizing expected return; they require principled uncertainty management. This paper presents the Uncertainty-Aware Markov Decision Process (UAMDP), a unified framework that couples Bayesian forecasting, posterior-sampling reinforcement learning, and planning under a conditional value-at-risk (CVaR) constraint. In a closed loop, the agent updates its beliefs over latent dynamics, samples plausible futures via Thompson sampling, and optimizes policies subject to preset risk tolerances. We establish regret bounds that converge to the Bayes-optimal benchmark under standard regularity conditions. We evaluate UAMDP in two domains-high-frequency equity trading and retail inventory control-both marked by structural uncertainty and economic volatility. Relative to strong deep learning baselines, UAMDP improves long-horizon forecasting accuracy (RMSE decreases by up to 25\% and sMAPE by 32\%), and these gains translate into economic performance: the trading Sharpe ratio rises from 1.54 to 1.74 while maximum drawdown is roughly halved. These results show that integrating calibrated probabilistic modeling, exploration aligned with posterior uncertainty, and risk-aware control yields a robust, generalizable approach to safer and more profitable sequential decision-making.
☆ Investigating Counterclaims in Causality Extraction from Text
Research on causality extraction from text has so far almost entirely neglected counterclaims. Existing causality extraction datasets focus solely on "procausal" claims, i.e., statements that support a relationship. "Concausal" claims, i.e., statements that refute a relationship, are entirely ignored or even accidentally annotated as procausal. We address this shortcoming by developing a new dataset that integrates concausality. Based on an extensive literature review, we first show that concausality is an integral part of causal reasoning on incomplete knowledge. We operationalize this theory in the form of a rigorous guideline for annotation and then augment the Causal News Corpus with concausal statements, obtaining a substantial inter-annotator agreement of Cohen's $\kappa=0.74$. To demonstrate the importance of integrating concausal statements, we show that models trained without concausal relationships tend to misclassify these as procausal instead. Based on our new dataset, this mistake can be mitigated, enabling transformers to effectively distinguish pro- and concausality.
☆ Post-hoc Stochastic Concept Bottleneck Models
Concept Bottleneck Models (CBMs) are interpretable models that predict the target variable through high-level human-understandable concepts, allowing users to intervene on mispredicted concepts to adjust the final output. While recent work has shown that modeling dependencies between concepts can improve CBM performance, especially under interventions, such approaches typically require retraining the entire model, which may be infeasible when access to the original data or compute is limited. In this paper, we introduce Post-hoc Stochastic Concept Bottleneck Models (PSCBMs), a lightweight method that augments any pre-trained CBM with a multivariate normal distribution over concepts by adding only a small covariance-prediction module, without retraining the backbone model. We propose two training strategies and show on real-world data that PSCBMs consistently match or improve both concept and target accuracy over standard CBMs at test time. Furthermore, we show that due to the modeling of concept dependencies, PSCBMs perform much better than CBMs under interventions, while remaining far more efficient than retraining a similar stochastic model from scratch.
☆ Expressive Value Learning for Scalable Offline Reinforcement Learning
Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.
comment: 24 pages, 5 figures
☆ FuelCast: Benchmarking Tabular and Temporal Models for Ship Fuel Consumption ECML
In the shipping industry, fuel consumption and emissions are critical factors due to their significant impact on economic efficiency and environmental sustainability. Accurate prediction of ship fuel consumption is essential for further optimization of maritime operations. However, heterogeneous methodologies and limited high-quality datasets hinder direct comparison of modeling approaches. This paper makes three key contributions: (1) we introduce and release a new dataset (https://huggingface.co/datasets/krohnedigital/FuelCast) comprising operational and environmental data from three ships; (2) we define a standardized benchmark covering tabular regression and time-series regression (3) we investigate the application of in-context learning for ship consumption modeling using the TabPFN foundation model - a first in this domain to our knowledge. Our results demonstrate strong performance across all evaluated models, supporting the feasibility of onboard, data-driven fuel prediction. Models incorporating environmental conditions consistently outperform simple polynomial baselines relying solely on vessel speed. TabPFN slightly outperforms other techniques, highlighting the potential of foundation models with in-context learning capabilities for tabular prediction. Furthermore, including temporal context improves accuracy.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published in "ECML PKDD Workshop 2025 - Advanced Analytics and Learning on Temporal Data"
☆ Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data
Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise, hindering model performance. While methods exist for each issue, effectively combining them is non-trivial, as distinguishing genuine tail samples from noisy data proves difficult, often leading to conflicting optimization strategies. This paper presents a novel perspective: instead of primarily developing new complex techniques from scratch, we explore synergistically leveraging well-established, individually 'weak' auxiliary models - specialized for tackling either class imbalance or label noise but not both. This view is motivated by the insight that class imbalance (a distributional-level concern) and label noise (a sample-level concern) operate at different granularities, suggesting that robustness mechanisms for each can in principle offer complementary strengths without conflict. We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights from such 'weak', single-purpose auxiliary models. Specifically, D-SINK uses an optimal transport-optimized surrogate label allocation to align the target model's sample-level predictions with a noise-robust auxiliary and its class distributions with an imbalance-robust one. Extensive experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
comment: 25 pages, 2 figures
☆ Long-tailed Recognition with Model Rebalancing
Long-tailed recognition is ubiquitous and challenging in deep learning and even in the downstream finetuning of foundation models, since the skew class distribution generally prevents the model generalization to the tail classes. Despite the promise of previous methods from the perspectives of data augmentation, loss rebalancing and decoupled training etc., consistent improvement in the broad scenarios like multi-label long-tailed recognition is difficult. In this study, we dive into the essential model capacity impact under long-tailed context, and propose a novel framework, Model Rebalancing (MORE), which mitigates imbalance by directly rebalancing the model's parameter space. Specifically, MORE introduces a low-rank parameter component to mediate the parameter space allocation guided by a tailored loss and sinusoidal reweighting schedule, but without increasing the overall model complexity or inference costs. Extensive experiments on diverse long-tailed benchmarks, spanning multi-class and multi-label tasks, demonstrate that MORE significantly improves generalization, particularly for tail classes, and effectively complements existing imbalance mitigation methods. These results highlight MORE's potential as a robust plug-and-play module in long-tailed settings.
☆ Leveraging Whisper Embeddings for Audio-based Lyrics Matching
Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.
☆ Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing NeurIPS 2025
Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM-Labs/denovo.
comment: Accepted by NeurIPS 2025
☆ Beyond Sub-6 GHz: Leveraging mmWave Wi-Fi for Gait-Based Person Identification
Person identification plays a vital role in enabling intelligent, personalized, and secure human-computer interaction. Recent research has demonstrated the feasibility of leveraging Wi-Fi signals for passive person identification using a person's unique gait pattern. Although most existing work focuses on sub-6 GHz frequencies, the emergence of mmWave offers new opportunities through its finer spatial resolution, though its comparative advantages for person identification remain unexplored. This work presents the first comparative study between sub-6 GHz and mmWave Wi-Fi signals for person identification with commercial off-the-shelf (COTS) Wi-Fi, using a novel dataset of synchronized measurements from the two frequency bands in an indoor environment. To ensure a fair comparison, we apply identical training pipelines and model configurations across both frequency bands. Leveraging end-to-end deep learning, we show that even at low sampling rates (10 Hz), mmWave Wi-Fi signals can achieve high identification accuracy (91.2% on 20 individuals) when combined with effective background subtraction.
☆ Quantum Agents for Algorithmic Discovery
We introduce quantum agents trained by episodic, reward-based reinforcement learning to autonomously rediscover several seminal quantum algorithms and protocols. In particular, our agents learn: efficient logarithmic-depth quantum circuits for the Quantum Fourier Transform; Grover's search algorithm; optimal cheating strategies for strong coin flipping; and optimal winning strategies for the CHSH and other nonlocal games. The agents achieve these results directly through interaction, without prior access to known optimal solutions. This demonstrates the potential of quantum intelligence as a tool for algorithmic discovery, opening the way for the automated design of novel quantum algorithms and protocols.
☆ Unsupervised Multi-Source Federated Domain Adaptation under Domain Diversity through Group-Wise Discrepancy Minimization
Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.
☆ AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents
The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.
comment: Accepted to the EMNLP 2025 Industry Track
☆ Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
☆ Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Finetuning
Reinforcement finetuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
☆ Approximate Domain Unlearning for Vision-Language Models NeurIPS 2025
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on class unlearning, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize real cars while avoiding misrecognition of illustrated cars depicted in roadside advertisements as real cars, which could be hazardous. In this paper, we introduce Approximate Domain Unlearning (ADU), a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., illustration) while preserving accuracy for other domains (e.g., real). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments show that our approach outperforms baselines built upon VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs. Code: https://kodaikawamura.github.io/Domain_Unlearning/.
comment: NeurIPS 2025 (Spotlight)
☆ High-dimensional Analysis of Synthetic Data Selection
Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.
☆ Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation
Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians' workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.
comment: 10 pages, 9 figures. This work has been submitted to the IEEE for possible publication
☆ Bayesian Decision Making around Experts
Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.
☆ Lossless Vocabulary Reduction for Auto-Regressive Language Models
Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
☆ Beyond Real Data: Synthetic Data through the Lens of Regularization
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.
☆ Computations and ML for surjective rational maps
The present note studies \emph{surjective rational endomorphisms} $f: \mathbb{P}^2 \dashrightarrow \mathbb{P}^2$ with \emph{cubic} terms and the indeterminacy locus $I_f \ne \emptyset$. We develop an experimental approach, based on some Python programming and Machine Learning, towards the classification of such maps; a couple of new explicit $f$ is constructed in this way. We also prove (via pure projective geometry) that a general non-regular cubic endomorphism $f$ of $\mathbb{P}^2$ is surjective if and only if the set $I_f$ has cardinality at least $3$.
comment: 15 pages, 2 figures, a couple of Python codes
☆ A Novel Ensemble Learning Approach for Enhanced IoT Attack Detection: Redefining Security Paradigms in Connected Systems
The rapid expansion of Internet of Things (IoT) devices has transformed industries and daily life by enabling widespread connectivity and data exchange. However, this increased interconnection has introduced serious security vulnerabilities, making IoT systems more exposed to sophisticated cyber attacks. This study presents a novel ensemble learning architecture designed to improve IoT attack detection. The proposed approach applies advanced machine learning techniques, specifically the Extra Trees Classifier, along with thorough preprocessing and hyperparameter optimization. It is evaluated on several benchmark datasets including CICIoT2023, IoTID20, BotNeTIoT L01, ToN IoT, N BaIoT, and BoT IoT. The results show excellent performance, achieving high recall, accuracy, and precision with very low error rates. These outcomes demonstrate the model efficiency and superiority compared to existing approaches, providing an effective and scalable method for securing IoT environments. This research establishes a solid foundation for future progress in protecting connected devices from evolving cyber threats.
comment: 14 pages, 5 fiugres, 7 tables
☆ Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
☆ Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection NeurIPS 2025
AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.
comment: Accepted at NeurIPS 2025 spotlight
☆ Mitigating Subject Dependency in EEG Decoding with Subject-Specific Low-Rank Adapters
Subject-specific distribution shifts represent an important obstacle to the development of foundation models for EEG decoding. To address this, we propose Subject-Conditioned Layer,, an adaptive layer designed as a drop-in replacement for standard linear or convolutional layers in any neural network architecture. Our layer captures subject-specific variability by decomposing its weights into a shared, subject-invariant component and a lightweight, low-rank correction unique to each subject. This explicit separation of general knowledge from personalized adaptation allows existing models to become robust to subject shifts. Empirically, models equipped with our layer outperform both a shared-weight-only model (subject-agnostic model) and the average of individually trained subject-specific models. Consequently, the Subject-Conditioned Layer, offers a practical and scalable path towards building effective cross-subject foundation models for EEG.
☆ From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill
Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to 39% and inflate energy consumption. We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to 70%, End-to-End latency by 41% and per-token energy by up to 22%. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.
comment: 13 pages, 5 figure, 8 tables
☆ Verifying Graph Neural Networks with Readout is Intractable
We introduce a logical language for reasoning about quantized aggregate-combine graph neural networks with global readout (ACR-GNNs). We provide a logical characterization and use it to prove that verification tasks for quantized GNNs with readout are (co)NEXPTIME-complete. This result implies that the verification of quantized GNNs is computationally intractable, prompting substantial research efforts toward ensuring the safety of GNN-based systems. We also experimentally demonstrate that quantized ACR-GNN models are lightweight while maintaining good accuracy and generalization capabilities with respect to non-quantized models.
☆ Climate Knowledge in Large Language Models
Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1{\deg} resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 {\deg}C and biases of $\pm$1 {\deg}C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 {\deg}C compared to 2-4 {\deg}C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.
comment: 16 pages, 4 figures, 2 tables
☆ Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity
Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 $\times$ width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.
☆ Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses
Model merging (MM) recently emerged as an effective method for combining large deep learning models. However, it poses significant security risks. Recent research shows that it is highly susceptible to backdoor attacks, which introduce a hidden trigger into a single fine-tuned model instance that allows the adversary to control the output of the final merged model at inference time. In this work, we propose a simple framework for understanding backdoor attacks by treating the attack itself as a task vector. $Backdoor\ Vector\ (BV)$ is calculated as the difference between the weights of a fine-tuned backdoored model and fine-tuned clean model. BVs reveal new insights into attacks understanding and a more effective framework to measure their similarity and transferability. Furthermore, we propose a novel method that enhances backdoor resilience through merging dubbed $Sparse\ Backdoor\ Vector\ (SBV)$ that combines multiple attacks into a single one. We identify the core vulnerability behind backdoor threats in MM: $inherent\ triggers$ that exploit adversarial weaknesses in the base model. To counter this, we propose $Injection\ BV\ Subtraction\ (IBVS)$ - an assumption-free defense against backdoors in MM. Our results show that SBVs surpass prior attacks and is the first method to leverage merging to improve backdoor effectiveness. At the same time, IBVS provides a lightweight, general defense that remains effective even when the backdoor threat is entirely unknown.
comment: 22 pages, 13 figures, 15 tables
☆ Unsupervised Radio Map Construction in Mixed LoS/NLoS Indoor Environments
Radio maps are essential for enhancing wireless communications and localization. However, existing methods for constructing radio maps typically require costly calibration processes to collect location-labeled channel state information (CSI) datasets. This paper aims to recover the data collection trajectory directly from the channel propagation sequence, eliminating the need for location calibration. The key idea is to employ a hidden Markov model (HMM)-based framework to conditionally model the channel propagation matrix, while simultaneously modeling the location correlation in the trajectory. The primary challenges involve modeling the complex relationship between channel propagation in multiple-input multiple-output (MIMO) networks and geographical locations, and addressing both line-of-sight (LOS) and non-line-of-sight (NLOS) indoor conditions. In this paper, we propose an HMM-based framework that jointly characterizes the conditional propagation model and the evolution of the user trajectory. Specifically, the channel propagation in MIMO networks is modeled separately in terms of power, delay, and angle, with distinct models for LOS and NLOS conditions. The user trajectory is modeled using a Gaussian-Markov model. The parameters for channel propagation, the mobility model, and LOS/NLOS classification are optimized simultaneously. Experimental validation using simulated MIMO-Orthogonal Frequency-Division Multiplexing (OFDM) networks with a multi-antenna uniform linear arrays (ULA) configuration demonstrates that the proposed method achieves an average localization accuracy of 0.65 meters in an indoor environment, covering both LOS and NLOS regions. Moreover, the constructed radio map enables localization with a reduced error compared to conventional supervised methods, such as k-nearest neighbors (KNN), support vector machine (SVM), and deep neural network (DNN).
☆ Accelerated Evolving Set Processes for Local PageRank Computation
This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by $\min\{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)\}$ to obtain an $\epsilon$-approximation of the PPR vector, where $m$ denotes the number of edges in the graph and $R$ is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only $\tilde{\mathcal{O}}(1/\sqrt{\alpha})$ such linear systems, where $\alpha$ is the damping factor. When $1/\epsilon^2\ll m$, this implies the existence of an algorithm that computes an $\ epsilon $-approximation of the PPR vector with an overall time complexity of $\tilde{\mathcal{O}}\left(R^2 / (\sqrt{\alpha}\epsilon^2)\right)$, independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.
☆ Language Models Do Not Embed Numbers Continuously
Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ($R^2 \geq 0.95$), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.
comment: 12 pages, 10 figures, 3 tables
☆ Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training
The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.
☆ DemandCast: Global hourly electricity demand forecasting NeurIPS 2025
This paper presents a machine learning framework for electricity demand forecasting across diverse geographical regions using the gradient boosting algorithm XGBoost. The model integrates historical electricity demand and comprehensive weather and socioeconomic variables to predict normalized electricity demand profiles. To enable robust training and evaluation, we developed a large-scale dataset spanning multiple years and countries, applying a temporal data-splitting strategy that ensures benchmarking of out-of-sample performance. Our approach delivers accurate and scalable demand forecasts, providing valuable insights for energy system planners and policymakers as they navigate the challenges of the global energy transition.
comment: 7 pages, 4 figures, accepted at the NeurIPS 2025 Workshop: Tackling Climate Change with Machine Learning
☆ Fewer Weights, More Problems: A Practical Attack on LLM Pruning
Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.
☆ Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training NeurIPS 2025
Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.
comment: This paper has been accepted by NeurIPS 2025 (Spotlight)
☆ VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks primarily focus on isolated capabilities such as transcription, or question-answering, and do not systematically evaluate agentic scenarios encompassing multilingual and cultural understanding, as well as adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in realistic spoken agentic settings. It comprises over 5,500 synthetic spoken queries, including dialogues grounded in Indian context, covering single-tool invocations, multi-tool workflows, multi-turn interactions, and safety evaluations. The benchmark supports English, Hindi, and 5 other Indian languages, reflecting real-world linguistic and cultural diversity. We simulate speaker variability using a novel sampling algorithm that selects audios for TTS voice conversion based on its speaker embeddings, maximizing acoustic and speaker diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Our experiments reveal significant gaps in contextual tool orchestration tasks, Indic generalization, and adversarial robustness, exposing critical limitations of current SpeechLMs.
☆ Climate Surrogates for Scalable Multi-Agent Reinforcement Learning: A Case Study with CICERO-SCM
Climate policy studies require models that capture the combined effects of multiple greenhouse gases on global temperature, but these models are computationally expensive and difficult to embed in reinforcement learning. We present a multi-agent reinforcement learning (MARL) framework that integrates a high-fidelity, highly efficient climate surrogate directly in the environment loop, enabling regional agents to learn climate policies under multi-gas dynamics. As a proof of concept, we introduce a recurrent neural network architecture pretrained on ($20{,}000$) multi-gas emission pathways to surrogate the climate model CICERO-SCM. The surrogate model attains near-simulator accuracy with global-mean temperature RMSE $\approx 0.0004 \mathrm{K}$ and approximately $1000\times$ faster one-step inference. When substituted for the original simulator in a climate-policy MARL setting, it accelerates end-to-end training by $>\!100\times$. We show that the surrogate and simulator converge to the same optimal policies and propose a methodology to assess this property in cases where using the simulator is intractable. Our work allows to bypass the core computational bottleneck without sacrificing policy fidelity, enabling large-scale multi-agent experiments across alternative climate-policy regimes with multi-gas dynamics and high-fidelity climate response.
☆ Stick-Breaking Mixture Normalizing Flows with Component-Wise Tail Adaptation for Variational Inference
Normalizing flows with a Gaussian base provide a computationally efficient way to approximate posterior distributions in Bayesian inference, but they often struggle to capture complex posteriors with multimodality and heavy tails. We propose a stick-breaking mixture base with component-wise tail adaptation (StiCTAF) for posterior approximation. The method first learns a flexible mixture base to mitigate the mode-seeking bias of reverse KL divergence through a weighted average of component-wise ELBOs. It then estimates local tail indices of unnormalized densities and finally refines each mixture component using a shared backbone combined with component-specific tail transforms calibrated by the estimated indices. This design enables accurate mode coverage and anisotropic tail modeling while retaining exact density evaluation and stable optimization. Experiments on synthetic posteriors demonstrate improved tail recovery and better coverage of multiple modes compared to benchmark models. We also present a real-data analysis illustrating the practical benefits of our approach for posterior inference.
☆ PRESCRIBE: Predicting Single-Cell Responses with Bayesian Estimation NeurIPS 2025
In single-cell perturbation prediction, a central task is to forecast the effects of perturbing a gene unseen in the training data. The efficacy of such predictions depends on two factors: (1) the similarity of the target gene to those covered in the training data, which informs model (epistemic) uncertainty, and (2) the quality of the corresponding training data, which reflects data (aleatoric) uncertainty. Both factors are critical for determining the reliability of a prediction, particularly as gene perturbation is an inherently stochastic biochemical process. In this paper, we propose PRESCRIBE (PREdicting Single-Cell Response wIth Bayesian Estimation), a multivariate deep evidential regression framework designed to measure both sources of uncertainty jointly. Our analysis demonstrates that PRESCRIBE effectively estimates a confidence score for each prediction, which strongly correlates with its empirical accuracy. This capability enables the filtering of untrustworthy results, and in our experiments, it achieves steady accuracy improvements of over 3% compared to comparable baselines.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG
Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.
comment: 12 pages, 4 figures
☆ DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $\textit{maximise diversity in model responses}$. Our method, $\textbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $\textbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.
☆ SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation ICME 2025
Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.
comment: accepted by ICME 2025
☆ TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
comment: Project page: https://ttom-t2v.github.io/
☆ Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks
This paper presents four theoretical contributions that improve the usability of risk certificates for neural networks based on PAC-Bayes bounds. First, two bounds on the KL divergence between Bernoulli distributions enable the derivation of the tightest explicit bounds on the true risk of classifiers across different ranges of empirical risk. The paper next focuses on the formalization of an efficient methodology based on implicit differentiation that enables the introduction of the optimization of PAC-Bayesian risk certificates inside the loss/objective function used to fit the network/model. The last contribution is a method to optimize bounds on non-differentiable objectives such as the 0-1 loss. These theoretical contributions are complemented with an empirical evaluation on the MNIST and CIFAR-10 datasets. In fact, this paper presents the first non-vacuous generalization bounds on CIFAR-10 for neural networks.
☆ Synergy Between the Strong and the Weak: Spiking Neural Networks are Inherently Self-Distillers NeurIPS 2025
Brain-inspired spiking neural networks (SNNs) promise to be a low-power alternative to computationally intensive artificial neural networks (ANNs), although performance gaps persist. Recent studies have improved the performance of SNNs through knowledge distillation, but rely on large teacher models or introduce additional training overhead. In this paper, we show that SNNs can be naturally deconstructed into multiple submodels for efficient self-distillation. We treat each timestep instance of the SNN as a submodel and evaluate its output confidence, thus efficiently identifying the strong and the weak. Based on this strong and weak relationship, we propose two efficient self-distillation schemes: (1) \textbf{Strong2Weak}: During training, the stronger "teacher" guides the weaker "student", effectively improving overall performance. (2) \textbf{Weak2Strong}: The weak serve as the "teacher", distilling the strong in reverse with underlying dark knowledge, again yielding significant performance gains. For both distillation schemes, we offer flexible implementations such as ensemble, simultaneous, and cascade distillation. Experiments show that our method effectively improves the discriminability and overall performance of the SNN, while its adversarial robustness is also enhanced, benefiting from the stability brought by self-distillation. This ingeniously exploits the temporal properties of SNNs and provides insight into how to efficiently train high-performance SNNs.
comment: Accepted by NeurIPS 2025
☆ SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening
Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a $(1+O(\epsilon))$ factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.
comment: 23 pages, 5 figures, Code Available: https://doi.org/10.5281/zenodo.17223405
☆ GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploratio
Overall architecture of the personalized multi-objective ranking system. It comprises: (1) a Feature Center and Prerank Model for initial feature processing and candidate generation; (2) a Multi-Task Learning (MTL) model predicting various user feedback signals; (3) a Multi-Task Fusion (MTF) module (our proposed GRADE framework) that learns personalized weights ($w_1, \dots, w_n$); these weights are then applied to calculate final scores and sorted to generate a blended ranking by the Blended Ranking Model, which ultimately delivers results to users.
☆ MMM: Quantum-Chemical Molecular Representation Learning for Combinatorial Drug Recommendation MICCAI
Drug recommendation is an essential task in machine learning-based clinical decision support systems. However, the risk of drug-drug interactions (DDI) between co-prescribed medications remains a significant challenge. Previous studies have used graph neural networks (GNNs) to represent drug structures. Regardless, their simplified discrete forms cannot fully capture the molecular binding affinity and reactivity. Therefore, we propose Multimodal DDI Prediction with Molecular Electron Localization Function (ELF) Maps (MMM), a novel framework that integrates three-dimensional (3D) quantum-chemical information into drug representation learning. It generates 3D electron density maps using the ELF. To capture both therapeutic relevance and interaction risks, MMM combines ELF-derived features that encode global electronic properties with a bipartite graph encoder that models local substructure interactions. This design enables learning complementary characteristics of drug molecules. We evaluate MMM in the MIMIC-III dataset (250 drugs, 442 substructures), comparing it with several baseline models. In particular, a comparison with the GNN-based SafeDrug model demonstrates statistically significant improvements in the F1-score (p = 0.0387), Jaccard (p = 0.0112), and the DDI rate (p = 0.0386). These results demonstrate the potential of ELF-based 3D representations to enhance prediction accuracy and support safer combinatorial drug prescribing in clinical practice.
comment: Medical Image Computing and Computer-Assisted Intervention (MICCAI) Predictive Intelligence in Medicine Workshop (MICCAI PRIME) 2025; 13 pages
☆ Multi-level informed optimization via decomposed Kriging for large design problems under uncertainty
Engineering design involves demanding models encompassing many decision variables and uncontrollable parameters. In addition, unavoidable aleatoric and epistemic uncertainties can be very impactful and add further complexity. The state-of-the-art adopts two steps, uncertainty quantification and design optimization, to optimize systems under uncertainty by means of robust or stochastic metrics. However, conventional scenario-based, surrogate-assisted, and mathematical programming methods are not sufficiently scalable to be affordable and precise in large and complex cases. Here, a multi-level approach is proposed to accurately optimize resource-intensive, high-dimensional, and complex engineering problems under uncertainty with minimal resources. A non-intrusive, fast-scaling, Kriging-based surrogate is developed to map the combined design/parameter domain efficiently. Multiple surrogates are adaptively updated by hierarchical and orthogonal decomposition to leverage the fewer and most uncertainty-informed data. The proposed method is statistically compared to the state-of-the-art via an analytical testbed and is shown to be concurrently faster and more accurate by orders of magnitude.
comment: 34 pages, 18 figures
☆ Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression Filtering Method for SEM Images
Scanning Electron Microscopy (SEM) images often suffer from noise contamination, which degrades image quality and affects further analysis. This research presents a complete approach to estimate their Signal-to-Noise Ratio (SNR) and noise variance (NV), and enhance image quality using NV-guided Wiener filter. The main idea of this study is to use a good SNR estimation technique and infuse a machine learning model to estimate NV of the SEM image, which then guides the wiener filter to remove the noise, providing a more robust and accurate SEM image filtering pipeline. First, we investigate five different SNR estimation techniques, namely Nearest Neighbourhood (NN) method, First-Order Linear Interpolation (FOL) method, Nearest Neighbourhood with First-Order Linear Interpolation (NN+FOL) method, Non-Linear Least Squares Regression (NLLSR) method, and Linear Least Squares Regression (LSR) method. It is shown that LSR method to perform better than the rest. Then, Support Vector Machines (SVM) and Gaussian Process Regression (GPR) are tested by pairing it with LSR. In this test, the Optimizable GPR model shows the highest accuracy and it stands as the most effective solution for NV estimation. Combining these results lead to the proposed Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression (AO-GPRLLSR) Filtering pipeline. The AO-GPRLLSR method generated an estimated noise variance which served as input to NV-guided Wiener filter for improving the quality of SEM images. The proposed method is shown to achieve notable success in estimating SNR and NV of SEM images and leads to lower Mean Squared Error (MSE) after the filtering process.
comment: "Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression Filtering Method for SEM Images," in IEEE Access, vol. 13, pp. 93574-93592, 2025, doi: 10.1109/ACCESS.2025.3573389
☆ Signal-to-Noise Ratio in Scanning Electron Microscopy: A Comprehensive Review
Scanning Electron Microscopy (SEM) is critical in nanotechnology, materials science, and biological imaging due to its high spatial resolution and depth of focus. Signal-to-noise ratio (SNR) is an essential parameter in SEM because it directly impacts the quality and interpretability of the images. SEM is widely used in various scientific disciplines, but its utility can be compromised by noise, which degrades image clarity. This review explores multiple aspects of the SEM imaging process, from the principal operation of SEM, sources of noise in SEM, methods for SNR measurement and estimations, to various aspects that affect the SNR measurement and approaches to enhance SNR, both from a hardware and software standpoint. We review traditional and emerging techniques, focusing on their applications, advantages, and limitations. The paper aims to provide a comprehensive understanding of SNR optimization in SEM for researchers and practitioners and to encourage further research in the field.
comment: in IEEE Access, vol. 13, pp. 154395-154421, 2025, doi: 10.1109/ACCESS.2025.3603013
☆ Team Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception - Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track
In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent's ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.
☆ On the Optimality of the Median-of-Means Estimator under Adversarial Contamination
The Median-of-Means (MoM) is a robust estimator widely used in machine learning that is known to be (minimax) optimal in scenarios where samples are i.i.d. In more grave scenarios, samples are contaminated by an adversary that can inspect and modify the data. Previous work has theoretically shown the suitability of the MoM estimator in certain contaminated settings. However, the (minimax) optimality of MoM and its limitations under adversarial contamination remain unknown beyond the Gaussian case. In this paper, we present upper and lower bounds for the error of MoM under adversarial contamination for multiple classes of distributions. In particular, we show that MoM is (minimax) optimal in the class of distributions with finite variance, as well as in the class of distributions with infinite variance and finite absolute $(1+r)$-th moment. We also provide lower bounds for MoM's error that match the order of the presented upper bounds, and show that MoM is sub-optimal for light-tailed distributions.
☆ On the Optimality of Tracking Fisher Information in Adaptive Testing with Stochastic Binary Responses
We study the problem of estimating a continuous ability parameter from sequential binary responses by actively asking questions with varying difficulties, a setting that arises naturally in adaptive testing and online preference learning. Our goal is to certify that the estimate lies within a desired margin of error, using as few queries as possible. We propose a simple algorithm that adaptively selects questions to maximize Fisher information and updates the estimate using a method-of-moments approach, paired with a novel test statistic to decide when the estimate is accurate enough. We prove that this Fisher-tracking strategy achieves optimal performance in both fixed-confidence and fixed-budget regimes, which are commonly invested in the best-arm identification literature. Our analysis overcomes a key technical challenge in the fixed-budget setting -- handling the dependence between the evolving estimate and the query distribution -- by exploiting a structural symmetry in the model and combining large deviation tools with Ville's inequality. Our results provide rigorous theoretical support for simple and efficient adaptive testing procedures.
☆ Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models
Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 25 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.
comment: 22 pages, 9 figures
☆ Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials
High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.
☆ Meta-Learning Based Few-Shot Graph-Level Anomaly Detection
Graph-level anomaly detection aims to identify anomalous graphs or subgraphs within graph datasets, playing a vital role in various fields such as fraud detection, review classification, and biochemistry. While Graph Neural Networks (GNNs) have made significant progress in this domain, existing methods rely heavily on large amounts of labeled data, which is often unavailable in real-world scenarios. Additionally, few-shot anomaly detection methods based on GNNs are prone to noise interference, resulting in poor embedding quality and reduced model robustness. To address these challenges, we propose a novel meta-learning-based graph-level anomaly detection framework (MA-GAD), incorporating a graph compression module that reduces the graph size, mitigating noise interference while retaining essential node information. We also leverage meta-learning to extract meta-anomaly information from similar networks, enabling the learning of an initialization model that can rapidly adapt to new tasks with limited samples. This improves the anomaly detection performance on target graphs, and a bias network is used to enhance the distinction between anomalous and normal nodes. Our experimental results, based on four real-world biochemical datasets, demonstrate that MA-GAD outperforms existing state-of-the-art methods in graph-level anomaly detection under few-shot conditions. Experiments on both graph anomaly and subgraph anomaly detection tasks validate the framework's effectiveness on real-world datasets.
comment: Accepted by ARRML2025
☆ Self-Improving LLM Agents at Test-Time
One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.
☆ MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation NeurIPS 2025
This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at https://github.com/ws-jiang/MetaDefense.
comment: Accepted By NeurIPS 2025
☆ Surrogate Graph Partitioning for Spatial Prediction
Spatial prediction refers to the estimation of unobserved values from spatially distributed observations. Although recent advances have improved the capacity to model diverse observation types, adoption in practice remains limited in industries that demand interpretability. To mitigate this gap, surrogate models that explain black-box predictors provide a promising path toward interpretable decision making. In this study, we propose a graph partitioning problem to construct spatial segments that minimize the sum of within-segment variances of individual predictions. The assignment of data points to segments can be formulated as a mixed-integer quadratic programming problem. While this formulation potentially enables the identification of exact segments, its computational complexity becomes prohibitive as the number of data points increases. Motivated by this challenge, we develop an approximation scheme that leverages the structural properties of graph partitioning. Experimental results demonstrate the computational efficiency of this approximation in identifying spatial segments.
comment: 18 pages, 5 figures, 2 tables
☆ SIMU: Selective Influence Machine Unlearning NeurIPS 2025
The undesired memorization of sensitive information by Large Language Models (LLMs) has emphasized the need for safety mechanisms that can regulate model behavior. This has led to the development of machine unlearning techniques that enable models to precisely forget sensitive and unwanted information. For machine unlearning, first-order and second-order optimizer-based methods have shown significant progress in enabling LLMs to forget targeted information. However, in doing so, these approaches often compromise the model's original capabilities, resulting in unlearned models that struggle to retain their prior knowledge and overall utility. To address this, we propose Selective Influence Machine Unlearning (SIMU), a two-step framework that enhances second-order optimizer-based unlearning by selectively updating only the critical neurons responsible for encoding the forget-set. By constraining updates to these targeted neurons, SIMU achieves comparable unlearning efficacy while substantially outperforming current methods in retaining the model's original knowledge.
comment: Accepted to NeurIPS 2025 Workshop: Constrained Optimization for Machine Learning (COML)
☆ Adaptive Execution Scheduler for DataDios SmartDiff
We present an adaptive scheduler for a single differencing engine (SmartDiff) with two execution modes: (i) in-memory threads and (ii) Dask based parallelism. The scheduler continuously tunes batch size and worker/thread count within fixed CPU and memory budgets to minimize p95 latency. A lightweight preflight profiler estimates bytes/row and I/O rate; an online cost/memory model prunes unsafe actions; and a guarded hill-climb policy favors lower latency with backpressure and straggler mitigation. Backend selection is gated by a conservative working-set estimate so that in-memory execution is chosen when safe, otherwise Dask is used. Across synthetic and public tabular benchmarks, the scheduler reduces p95 latency by 23 to 28 percent versus a tuned warm-up heuristic (and by 35 to 40 percent versus fixed grid baselines), while lowering peak memory by 16 to 22 percent (25 to 32 percent vs. fixed) with zero OOMs and comparable throughput.
comment: 4 pages, 1 figure
☆ HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs
The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.
☆ HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.
comment: Under review
☆ Weak Form Learning for Mean-Field Partial Differential Equations: an Application to Insect Movement
Insect species subject to infection, predation, and anisotropic environmental conditions may exhibit preferential movement patterns. Given the innate stochasticity of exogenous factors driving these patterns over short timescales, individual insect trajectories typically obey overdamped stochastic dynamics. In practice, data-driven modeling approaches designed to learn the underlying Fokker-Planck equations from observed insect distributions serve as ideal tools for understanding and predicting such behavior. Understanding dispersal dynamics of crop and silvicultural pests can lead to a better forecasting of outbreak intensity and location, which can result in better pest management. In this work, we extend weak-form equation learning techniques, coupled with kernel density estimation, to learn effective models for lepidopteran larval population movement from highly sparse experimental data. Galerkin methods such as the Weak form Sparse Identification of Nonlinear Dynamics (WSINDy) algorithm have recently proven useful for learning governing equations in several scientific contexts. We demonstrate the utility of the method on a sparse dataset of position measurements of fall armyworms (Spodoptera frugiperda) obtained in simulated agricultural conditions with varied plant resources and infection status.
comment: 39 pages, 16 figures
☆ PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations
Large Language Models (LLMs) pose a new paradigm of modeling and computation for information tasks. Recommendation systems are a critical application domain poised to benefit significantly from the sequence modeling capabilities and world knowledge inherent in these large models. In this paper, we introduce PLUM, a framework designed to adapt pre-trained LLMs for industry-scale recommendation tasks. PLUM consists of item tokenization using Semantic IDs, continued pre-training (CPT) on domain-specific data, and task-specific fine-tuning for recommendation objectives. For fine-tuning, we focus particularly on generative retrieval, where the model is directly trained to generate Semantic IDs of recommended items based on user context. We conduct comprehensive experiments on large-scale internal video recommendation datasets. Our results demonstrate that PLUM achieves substantial improvements for retrieval compared to a heavily-optimized production model built with large embedding tables. We also present a scaling study for the model's retrieval performance, our learnings about CPT, a few enhancements to Semantic IDs, along with an overview of the training and inference methods that enable launching this framework to billions of users in YouTube.
comment: 11 pages, 6 figures
☆ Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.
☆ FedLAM: Low-latency Wireless Federated Learning via Layer-wise Adaptive Modulation
In wireless federated learning (FL), the clients need to transmit the high-dimensional deep neural network (DNN) parameters through bandwidth-limited channels, which causes the communication latency issue. In this paper, we propose a layer-wise adaptive modulation scheme to save the communication latency. Unlike existing works which assign the same modulation level for all DNN layers, we consider the layers' importance which provides more freedom to save the latency. The proposed scheme can automatically decide the optimal modulation levels for different DNN layers. Experimental results show that the proposed scheme can save up to 73.9% of communication latency compared with the existing schemes.
☆ A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization
In online advertising, heterogeneous advertiser requirements give rise to numerous customized bidding tasks that are typically optimized independently, resulting in extensive computation and limited data efficiency. Multi-task learning offers a principled framework to train these tasks jointly through shared representations. However, existing multi-task optimization strategies are primarily guided by training dynamics and often generalize poorly in volatile bidding environments. To this end, we present Validation-Aligned Multi-task Optimization (VAMO), which adaptively assigns task weights based on the alignment between per-task training gradients and a held-out validation gradient, thereby steering updates toward validation improvement and better matching deployment objectives. We further equip the framework with a periodicity-aware temporal module and couple it with an advanced generative auto-bidding backbone to enhance cross-task transfer of seasonal structure and strengthen bidding performance. Meanwhile, we provide theoretical insights into the proposed method, e.g., convergence guarantee and alignment analysis. Extensive experiments on both simulated and large-scale real-world advertising systems consistently demonstrate significant improvements over typical baselines, illuminating the effectiveness of the proposed approach.
☆ Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization
Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{R\'enyi sharpness}, which is defined as the negative R\'enyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{R\'enyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (R\'enyi) entropy. To rigorously establish the relationship between generalization and (R\'enyi) sharpness, we provide several generalization bounds in terms of R\'enyi sharpness, by taking advantage of the reparametrization invariance property of R\'enyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the R\'enyi sharpness and generalization. Moreover, we propose to use a variant of R\'enyi Sharpness as regularizer during training, i.e., R\'enyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.
☆ FedBook: A Unified Federated Graph Foundation Codebook with Intra-domain and Inter-domain Knowledge Modeling
Foundation models have shown remarkable cross-domain generalization in language and vision, inspiring the development of graph foundation models (GFMs). However, existing GFMs typically assume centralized access to multi-domain graphs, which is often infeasible due to privacy and institutional constraints. Federated Graph Foundation Models (FedGFMs) address this limitation, but their effectiveness fundamentally hinges on constructing a robust global codebook that achieves intra-domain coherence by consolidating mutually reinforcing semantics within each domain, while also maintaining inter-domain diversity by retaining heterogeneous knowledge across domains. To this end, we propose FedBook, a unified federated graph foundation codebook that systematically aggregates clients' local codebooks during server-side federated pre-training. FedBook follows a two-phase process: (1) Intra-domain Collaboration, where low-frequency tokens are refined by referencing more semantically reliable high-frequency tokens across clients to enhance domain-specific coherence; and (2) Inter-domain Integration, where client contributions are weighted by the semantic distinctiveness of their codebooks during the aggregation of the global GFM, thereby preserving cross-domain diversity. Extensive experiments on 8 benchmarks across multiple domains and tasks demonstrate that FedBook consistently outperforms 21 baselines, including isolated supervised learning, FL/FGL, federated adaptations of centralized GFMs, and FedGFM techniques.
comment: Under Review
☆ When Robustness Meets Conservativeness: Conformalized Uncertainty Calibration for Balanced Decision Making
Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage-regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost-risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance than existing approaches. These results offer the first principled data-driven methodology for guiding robustness selection and empower practitioners to balance robustness and conservativeness in high-stakes decision-making.
☆ t-SNE Exaggerates Clusters, Provably
Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.
☆ Parallel Test-Time Scaling for Latent Reasoning Models
Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.
☆ MeSH: Memory-as-State-Highways for Recursive Transformers
Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.
☆ ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs
Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.
☆ GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation
Location-Based Social Network (LBSN) check-in trajectory data are important for many practical applications, like POI recommendation, advertising, and pandemic intervention. However, the high collection costs and ever-increasing privacy concerns prevent us from accessing large-scale LBSN trajectory data. The recent advances in synthetic data generation provide us with a new opportunity to achieve this, which utilizes generative AI to generate synthetic data that preserves the characteristics of real data while ensuring privacy protection. However, generating synthetic LBSN check-in trajectories remains challenging due to their spatially discrete, temporally irregular nature and the complex spatio-temporal patterns caused by sparse activities and uncertain human mobility. To address this challenge, we propose GeoGen, a two-stage coarse-to-fine framework for large-scale LBSN check-in trajectory generation. In the first stage, we reconstruct spatially continuous, temporally regular latent movement sequences from the original LBSN check-in trajectories and then design a Sparsity-aware Spatio-temporal Diffusion model (S$^2$TDiff) with an efficient denosing network to learn their underlying behavioral patterns. In the second stage, we design Coarse2FineNet, a Transformer-based Seq2Seq architecture equipped with a dynamic context fusion mechanism in the encoder and a multi-task hybrid-head decoder, which generates fine-grained LBSN trajectories based on coarse-grained latent movement sequences by modeling semantic relevance and behavioral uncertainty. Extensive experiments on four real-world datasets show that GeoGen excels state-of-the-art models for both fidelity and utility evaluation, e.g., it increases over 69% and 55% in distance and radius metrics on the FS-TKY dataset.
☆ DEAS: DEtached value learning with Action Sequence for Scalable Offline RL
Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.
comment: Project website: https://changyeon.site/deas
☆ Computationally-efficient Graph Modeling with Refined Graph Random Features
We propose refined GRFs (GRFs++), a new class of Graph Random Features (GRFs) for efficient and accurate computations involving kernels defined on the nodes of a graph. GRFs++ resolve some of the long-standing limitations of regular GRFs, including difficulty modeling relationships between more distant nodes. They reduce dependence on sampling long graph random walks via a novel walk-stitching technique, concatenating several shorter walks without breaking unbiasedness. By applying these techniques, GRFs++ inherit the approximation quality provided by longer walks but with greater efficiency, trading sequential, inefficient sampling of a long walk for parallel computation of short walks and matrix-matrix multiplication. Furthermore, GRFs++ extend the simplistic GRFs walk termination mechanism (Bernoulli schemes with fixed halting probabilities) to a broader class of strategies, applying general distributions on the walks' lengths. This improves the approximation accuracy of graph kernels, without incurring extra computational cost. We provide empirical evaluations to showcase all our claims and complement our results with theoretical analysis.
comment: Preprint. Comments welcome
☆ Causality Guided Representation Learning for Cross-Style Hate Speech Detection
The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language -- making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.
☆ Large Language Models Meet Virtual Cell: A Survey
Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.
☆ LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning
In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher's verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model's reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.
comment: 12 pages, 8 figures
☆ FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning NeurIPS 2025
Federated learning (FL) enables collaborative model training across multiple parties without sharing raw data, with semi-asynchronous FL (SAFL) emerging as a balanced approach between synchronous and asynchronous FL. However, SAFL faces significant challenges in optimizing both gradient-based (e.g., FedSGD) and model-based (e.g., FedAvg) aggregation strategies, which exhibit distinct trade-offs in accuracy, convergence speed, and stability. While gradient aggregation achieves faster convergence and higher accuracy, it suffers from pronounced fluctuations, whereas model aggregation offers greater stability but slower convergence and suboptimal accuracy. This paper presents FedQS, the first framework to theoretically analyze and address these disparities in SAFL. FedQS introduces a divide-and-conquer strategy to handle client heterogeneity by classifying clients into four distinct types and adaptively optimizing their local training based on data distribution characteristics and available computational resources. Extensive experiments on computer vision, natural language processing, and real-world tasks demonstrate that FedQS achieves the highest accuracy, attains the lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines. Our work bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning. The code and datasets are available at https://anonymous.4open.science/r/FedQS-EDD6.
comment: Accepted by NeurIPS 2025
☆ Incremental Hybrid Ensemble with Graph Attention and Frequency-Domain Features for Stable Long-Term Credit Risk Modeling
Predicting long-term loan defaults is hard because borrower behavior often changes and data distributions shift over time. This paper presents HYDRA-EI, a hybrid ensemble incremental learning framework. It uses several stages of feature processing and combines multiple models. The framework builds relational, cross, and frequency-based features. It uses graph attention, automatic cross-feature creation, and transformations from the frequency domain. HYDRA-EI updates weekly using new data and adjusts the model weights with a simple performance-based method. It works without frequent manual changes or fixed retraining. HYDRA-EI improves model stability and generalization, which makes it useful for long-term credit risk tasks.
☆ Value Flows
While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows
☆ A Honest Cross-Validation Estimator for Prediction Performance
Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, we propose a new method to estimate the performance of a model trained on a specific (random) training set. A naive estimator can be obtained by applying the model to a disjoint testing set. Surprisingly, cross-validation estimators computed from other random splits can be used to improve this naive estimator within a random-effects model framework. We develop two estimators -- a hierarchical Bayesian estimator and an empirical Bayes estimator -- that perform similarly to or better than both the conventional cross-validation estimator and the naive single-split estimator. Simulations and a real-data example demonstrate the superior performance of the proposed method.
☆ Continual Learning for Adaptive AI Systems
Continual learning the ability of a neural network to learn multiple sequential tasks without losing previously acquired knowledge remains a significant obstacle to developing truly adaptive artificial intelligence. Deep learning models have achieved remarkable results in various applications, but overfitting remains a common issue. Regularization techniques can help prevent overfitting by adding constraints to the model's parameters. To prevent catastrophic forgetting, in this paper we introduce a novel regularization technique based on inter-cluster separation (ICS) in the loss function, which penalizes the model for producing outputs that are far away from the centroids of the clusters formed by the data from previous tasks. We also performed hyperparameter tuning to find the optimal weighting of the proposed regularization term. This ensures clearer separation between tasks in the neural network's internal representation, reducing overlap and mitigating forgetting. Using the standard 5-task Split CIFAR-10 benchmark and a ResNet-18 architecture, we demonstrate ICS's effectiveness in maintaining strong performance on initial tasks. However, our results also highlight limitations in long-term knowledge retention, particularly when the number of tasks increases. This underscores the complexity and trade-offs inherent in continual learning and points toward avenues for further research.
comment: 5 pages 2 figures 2 tables
☆ Design-Based Bandits Under Network Interference: Trade-Off Between Regret and Statistical Inference
In multi-armed bandits with network interference (MABNI), the action taken by one node can influence the rewards of others, creating complex interdependence. While existing research on MABNI largely concentrates on minimizing regret, it often overlooks the crucial concern that an excessive emphasis on the optimal arm can undermine the inference accuracy for sub-optimal arms. Although initial efforts have been made to address this trade-off in single-unit scenarios, these challenges have become more pronounced in the context of MABNI. In this paper, we establish, for the first time, a theoretical Pareto frontier characterizing the trade-off between regret minimization and inference accuracy in adversarial (design-based) MABNI. We further introduce an anytime-valid asymptotic confidence sequence along with a corresponding algorithm, $\texttt{EXP3-N-CS}$, specifically designed to balance the trade-off between regret minimization and inference accuracy in this setting.
☆ Property Classification of Vacation Rental Properties during Covid-19
This study advocates for employing clustering techniques to classify vacation rental properties active during the Covid pandemic to identify inherent patterns and behaviours. The dataset, a collaboration between the ESRC funded Consumer Data Research Centre (CDRC) and AirDNA, encompasses data for over a million properties and hosts. Utilising K-means and K-medoids clustering techniques, we identify homogenous groups and their common characteristics. Our findings enhance comprehension of the intricacies of vacation rental evaluations and could potentially be utilised in the creation of targeted, cluster-specific policies.
comment: GISRUK 2024 Poster
☆ Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.
☆ DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
Existing batch size selection approaches in distributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequential decision-making problem using Proximal Policy Optimization (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse workloads, hardware configurations, and network conditions, DYNAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.
☆ The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Large Language Models (LLMs) are increasingly integral to information dissemination and decision-making processes. Given their growing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propagation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the inherent political leanings of these models. Subsequently, persona prompting with the PCT is used to explore explicit stereotypes across various social dimensions. In a final step, implicit stereotypes are uncovered by evaluating models with multilingual versions of the PCT. Key findings reveal a consistent left-leaning political alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those identified via explicit persona prompting. Interestingly, for most models, implicit and explicit stereotypes show a notable alignment, suggesting a degree of transparency or "awareness" regarding their inherent biases. This study underscores the complex interplay of political bias and stereotypes in LLMs.
☆ Team Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception -- Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track
In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent's ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.
☆ Multi-fidelity Batch Active Learning for Gaussian Process Classifiers
Many science and engineering problems rely on expensive computational simulations, where a multi-fidelity approach can accelerate the exploration of a parameter space. We study efficient allocation of a simulation budget using a Gaussian Process (GP) model in the binary simulation output case. This paper introduces Bernoulli Parameter Mutual Information (BPMI), a batch active learning algorithm for multi-fidelity GP classifiers. BPMI circumvents the intractability of calculating mutual information in the probability space by employing a first-order Taylor expansion of the link function. We evaluate BPMI against several baselines on two synthetic test cases and a complex, real-world application involving the simulation of a laser-ignited rocket combustor. In all experiments, BPMI demonstrates superior performance, achieving higher predictive accuracy for a fixed computational budget.
☆ Sparse components distinguish visual pathways & their alignment to neural networks
The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams -- identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system's underlying axes of neural tuning.
☆ Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training ICLR 2025
Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.
comment: First submitted on February 10th, 2025 to ICLR 2025 Workshop (XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge). The paper was accepted but the workshop does not generate proceedings. Now uploading to arXiv to make the paper publicly available
☆ On the Alignment Between Supervised and Self-Supervised Contrastive Learning
Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].
☆ The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective NeurIPS 2025
As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.
comment: Accepted at NeurIPS 2025
☆ Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction
Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.
☆ Long-Tailed Recognition via Information-Preservable Two-Stage Learning NeurIPS 2025
The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intra-class distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model's overall performance. As a result, our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets, validated via extensive experiments. Our code is available at https://github.com/fudong03/BNS_IPDPP.
comment: Accepted by NeurIPS 2025 as Spotlight
☆ CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.
☆ TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference
Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from textual graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.
☆ Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation
From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.
☆ Man-Made Heuristics Are Dead. Long Live Code Generators!
Policy design for various systems controllers has conventionally been a manual process, with domain experts carefully tailoring heuristics for the specific instance in which the policy will be deployed. In this paper, we re-imagine policy design via a novel automated search technique fueled by recent advances in generative models, specifically Large Language Model (LLM)-driven code generation. We outline the design and implementation of PolicySmith, a framework that applies LLMs to synthesize instance-optimal heuristics. We apply PolicySmith to two long-standing systems policies - web caching and congestion control, highlighting the opportunities unraveled by this LLM-driven heuristic search. For caching, PolicySmith discovers heuristics that outperform established baselines on standard open-source traces. For congestion control, we show that PolicySmith can generate safe policies that integrate directly into the Linux kernel.
comment: 10 pages, 2 figures, 2 tables. To be presented at HotNets 2025
☆ Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops ICML
Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems
comment: 6 Pages, 6 Figures, 3 Tables, Accepted as a Regular Research paper at ICMLA 2025
☆ SkipSR: Faster Super Resolution with Token Skipping
Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/
comment: 14 pages, 7 figures
☆ TAPAS: Datasets for Learning the Learning with Errors Problem
AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform "classical" attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners' ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.
☆ PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure
We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov--Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.
☆ Deceptive Exploration in Multi-armed Bandits
We consider a multi-armed bandit setting in which each arm has a public and a private reward distribution. An observer expects an agent to follow Thompson Sampling according to the public rewards, however, the deceptive agent aims to quickly identify the best private arm without being noticed. The observer can observe the public rewards and the pulled arms, but not the private rewards. The agent, on the other hand, observes both the public and private rewards. We formalize detectability as a stepwise Kullback-Leibler (KL) divergence constraint between the actual pull probabilities used by the agent and the anticipated pull probabilities by the observer. We model successful pulling of public suboptimal arms as a % Bernoulli process where the success probability decreases with each successful pull, and show these pulls can happen at most at a $\Theta(\sqrt{T}) $ rate under the KL constraint. We then formulate a maximin problem based on public and private means, whose solution characterizes the optimal error exponent for best private arm identification. We finally propose an algorithm inspired by top-two algorithms. This algorithm naturally adapts its exploration according to the hardness of pulling arms based on the public suboptimality gaps. We provide numerical examples illustrating the $\Theta(\sqrt{T}) $ rate and the behavior of the proposed algorithm.
☆ Weights initialization of neural networks for function approximation
Neural network-based function approximation plays a pivotal role in the advancement of scientific computing and machine learning. Yet, training such models faces several challenges: (i) each target function often requires training a new model from scratch; (ii) performance is highly sensitive to architectural and hyperparameter choices; and (iii) models frequently generalize poorly beyond the training domain. To overcome these challenges, we propose a reusable initialization framework based on basis function pretraining. In this approach, basis neural networks are first trained to approximate families of polynomials on a reference domain. Their learned parameters are then used to initialize networks for more complex target functions. To enhance adaptability across arbitrary domains, we further introduce a domain mapping mechanism that transforms inputs into the reference domain, thereby preserving structural correspondence with the pretrained models. Extensive numerical experiments in one- and two-dimensional settings demonstrate substantial improvements in training efficiency, generalization, and model transferability, highlighting the promise of initialization-based strategies for scalable and modular neural function approximation. The full code is made publicly available on Gitee.
comment: 19 pages, 10 figures
☆ Guiding Exploration in Reinforcement Learning Through LLM-Augmented Observations
Reinforcement Learning (RL) agents often struggle in sparse-reward environments where traditional exploration strategies fail to discover effective action sequences. Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining that could guide RL exploration, but existing approaches create rigid dependencies where RL policies must follow LLM suggestions or incorporate them directly into reward functions. We propose a framework that provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore this guidance. Our method leverages LLMs' world knowledge and reasoning abilities while maintaining flexibility through soft constraints. We evaluate our approach on three BabyAI environments of increasing complexity and show that the benefits of LLM guidance scale with task difficulty. In the most challenging environment, we achieve 71% relative improvement in final success rates over baseline. The approach provides substantial sample efficiency gains, with agents reaching performance thresholds up to 9 times faster, and requires no modifications to existing RL algorithms. Our results demonstrate an effective method for leveraging LLM planning capabilities to accelerate RL training in challenging environments.
comment: Accepted to LM4Plan Workshop @ ICAPS 2025 (withdrawn before presentation due to lack of travel funding)
☆ Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings
Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM's internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.
☆ Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform
This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.
comment: 6 pages
☆ Prioritizing Latency with Profit: A DRL-Based Admission Control for 5G Network Slices
5G networks enable diverse services such as eMBB, URLLC, and mMTC through network slicing, necessitating intelligent admission control and resource allocation to meet stringent QoS requirements while maximizing Network Service Provider (NSP) profits. However, existing Deep Reinforcement Learning (DRL) frameworks focus primarily on profit optimization without explicitly accounting for service delay, potentially leading to QoS violations for latency-sensitive slices. Moreover, commonly used epsilon-greedy exploration of DRL often results in unstable convergence and suboptimal policy learning. To address these gaps, we propose DePSAC -- a Delay and Profit-aware Slice Admission Control scheme. Our DRL-based approach incorporates a delay-aware reward function, where penalties due to service delay incentivize the prioritization of latency-critical slices such as URLLC. Additionally, we employ Boltzmann exploration to achieve smoother and faster convergence. We implement and evaluate DePSAC on a simulated 5G core network substrate with realistic Network Slice Request (NSLR) arrival patterns. Experimental results demonstrate that our method outperforms the DSARA baseline in terms of overall profit, reduced URLLC slice delays, improved acceptance rates, and improved resource consumption. These findings validate the effectiveness of the proposed DePSAC in achieving better QoS-profit trade-offs for practical 5G network slicing scenarios.
comment: 6 pages, 8 figures, Accepted for publication at IEEE ANTS 2025
☆ Zero-Shot Policy Transfer in Reinforcement Learning using Buckingham's Pi Theorem
Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham's Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.
♻ ☆ On the Convergence of Moral Self-Correction in Large Language Models
Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
comment: 19pages, 7 figures
♻ ☆ Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging
We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.
♻ ☆ Want to train KANS at scale? Now UKAN!
Kolmogorov-Arnold Networks (KANs) have recently emerged as a powerful alternative to traditional multilayer perceptrons. However, their reliance on predefined, bounded grids restricts their ability to approximate functions on unbounded domains. To address this, we present Unbounded Kolmogorov-Arnold Networks (UKANs), a method that removes the need for bounded grids in traditional Kolmogorov-Arnold Networks (KANs). The key innovation of this method is a coefficient-generator (CG) model that produces, on the fly, only the B-spline coefficients required locally on an unbounded symmetric grid. UKANs couple multilayer perceptrons with KANs by feeding the positional encoding of grid groups into the CG model, enabling function approximation on unbounded domains without requiring data normalization. To reduce the computational cost of both UKANs and KANs, we introduce a GPU-accelerated library that lowers B-spline evaluation complexity by a factor proportional to the grid size, enabling large-scale learning by leveraging efficient memory management, in line with recent software advances such as FlashAttention and FlashFFTConv. Performance benchmarking confirms the superior memory and computational efficiency of our accelerated KAN (warpKAN), and UKANs, showing a 3-30x speed-up and up to 1000x memory reduction compared to vanilla KANs. Experiments on regression, classification, and generative tasks demonstrate the effectiveness of UKANs to match or surpass KAN accuracy. Finally, we use both accelerated KAN and UKAN in a molecular property prediction task, establishing the feasibility of large-scale end-to-end training with our optimized implementation.
comment: 16 pages, 5 figures, 8 tables
♻ ☆ SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.
♻ ☆ Is Supervised Learning Really That Different from Unsupervised?
We demonstrate how supervised learning can be decomposed into a two-stage procedure, where (1) all model parameters are selected in an unsupervised manner, and (2) the outputs y are added to the model, without changing the parameter values. This is achieved by a new model selection criterion that, in contrast to cross-validation, can be used also without access to y. For linear ridge regression, we bound the asymptotic out-of-sample risk of our method in terms of the optimal asymptotic risk. We also demonstrate on real and synthetic data that versions of linear and kernel ridge regression, smoothing splines, and neural networks, which are trained without access to y, perform similarly to their standard y-based counterparts. Hence, our results suggest that the difference between supervised and unsupervised learning is less fundamental than it may appear.
♻ ☆ IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs
Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\, -- \,a quality-constrained \textbf{I}ntelligent \textbf{P}rompt \textbf{R}outing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $\tau \in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency. The deployed system and additional product details are publicly available at https://aws.amazon.com/bedrock/intelligent-prompt-routing/
♻ ☆ Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback
We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents' context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.
comment: Accepted at EMNLP 2025 Industry Track
♻ ☆ XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation
Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.
♻ ☆ DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization
Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.
comment: Accepted to the NewSumm Workshop at EMNLP 2025. Equal contribution from the first four authors
♻ ☆ Graph-SCP: Accelerating Set Cover Problems with Graph Neural Networks
Machine learning (ML) approaches are increasingly being used to accelerate combinatorial optimization (CO) problems. We investigate the Set Cover Problem (SCP) and propose Graph-SCP, a graph neural network method that augments existing optimization solvers by learning to identify a smaller sub-problem that contains the solution space. Graph-SCP uses both supervised learning from prior solved instances and unsupervised learning to minimize the SCP objective. We evaluate the performance of Graph-SCP on synthetically weighted and unweighted SCP instances with diverse problem characteristics and complexities, and on instances from the OR Library, a canonical benchmark for SCP. We show that Graph-SCP reduces the problem size by 60-80% and achieves runtime speedups of up to 10x on average when compared to Gurobi (a state-of-the-art commercial solver), while maintaining solution quality. This is in contrast to fast greedy solutions that significantly compromise solution quality to achieve guaranteed polynomial runtime. We showcase Graph-SCP's ability to generalize to larger problem sizes, training on SCP instances with up to 3,000 subsets and testing on SCP instances with up to 10,000 subsets.
♻ ☆ Understanding In-context Learning of Addition via Activation Subspaces
To perform few-shot learning, language models extract signals from a few input-label pairs, aggregate these into a learned prediction rule, and apply this rule to new inputs. How is this implemented in the forward pass of modern transformer models? To explore this question, we study a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We introduce a novel optimization method that localizes the model's few-shot ability to only a few attention heads. We then perform an in-depth analysis of individual heads, via dimensionality reduction and decomposition. As an example, on Llama-3-8B-instruct, we reduce its mechanism on our tasks to just three attention heads with six-dimensional subspaces, where four dimensions track the unit digit with trigonometric functions at periods $2$, $5$, and $10$, and two dimensions track magnitude with low-frequency components. To deepen our understanding of the mechanism, we also derive a mathematical identity relating ``aggregation'' and ``extraction'' subspaces for attention heads, allowing us to track the flow of information from individual examples to a final aggregated concept. Using this, we identify a self-correction mechanism where mistakes learned from earlier demonstrations are suppressed by later demonstrations. Our results demonstrate how tracking low-dimensional subspaces of localized heads across a forward pass can provide insight into fine-grained computational structures in language models.
♻ ☆ From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning
Real-world graph datasets often consist of mixtures of populations, where graphs are generated from multiple distinct underlying distributions. However, modern representation learning approaches, such as graph contrastive learning (GCL) and augmentation methods like Mixup, typically overlook this mixture structure. In this work, we propose a unified framework that explicitly models data as a mixture of underlying probabilistic graph generative models represented by graphons. To characterize these graphons, we leverage graph moments (motif densities) to cluster graphs arising from the same model. This enables us to disentangle the mixture components and identify their distinct generative mechanisms. This model-aware partitioning benefits two key graph learning tasks: 1) It enables a graphon-mixture-aware mixup (GMAM), a data augmentation technique that interpolates in a semantically valid space guided by the estimated graphons, instead of assuming a single graphon per class. 2) For GCL, it enables model-adaptive and principled augmentations. Additionally, by introducing a new model-aware objective, our proposed approach (termed MGCL) improves negative sampling by restricting negatives to graphs from other models. We establish a key theoretical guarantee: a novel, tighter bound showing that graphs sampled from graphons with small cut distance will have similar motif densities with high probability. Extensive experiments on benchmark datasets demonstrate strong empirical performance. In unsupervised learning, MGCL achieves state-of-the-art results, obtaining the top average rank across eight datasets. In supervised learning, GMAM consistently outperforms existing strategies, achieving new state-of-the-art accuracy in 6 out of 7 datasets.
♻ ☆ Feature Identification via the Empirical NTK
We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across two standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS) and a 1-layer MLP trained on modular addition, we find that the eNTK exhibits sharp spectral cliffs whose top eigenspaces align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK spectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.
comment: 14 pages, 5 figures. v2: references and expanded discussion in Appendix B added
♻ ☆ Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.
comment: dataset link: https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
♻ ☆ Training a Foundation Model for Materials on a Budget NeurIPS 2025
Foundation models for materials modeling are advancing quickly, but their training remains expensive, often placing state-of-the-art methods out of reach for many research groups. We introduce Nequix, a compact E(3)-equivariant potential that pairs a simplified NequIP design with modern training practices, including equivariant root-mean-square layer normalization and the Muon optimizer, to retain accuracy while substantially reducing compute requirements. Nequix has 700K parameters and was trained in 100 A100 GPU-hours. On the Matbench-Discovery and MDR Phonon benchmarks, Nequix ranks third overall while requiring a 20 times lower training cost than most other methods, and it delivers two orders of magnitude faster inference speed than the current top-ranked model. We release model weights and fully reproducible codebase at https://github.com/atomicarchitects/nequix.
comment: To appear at the NeurIPS 2025 Workshop on AI for Accelerated Materials Discovery
♻ ☆ Efficient Graph Condensation via Gaussian Process
Graph condensation reduces the size of large graphs while preserving performance, addressing the scalability challenges of Graph Neural Networks caused by computational inefficiencies on large datasets. Existing methods often rely on bi-level optimization, requiring extensive GNN training and limiting their scalability. To address these issues, this paper proposes Graph Condensation via Gaussian Process (GCGP), a novel and computationally efficient approach to graph condensation. GCGP utilizes a Gaussian Process (GP), with the condensed graph serving as observations, to estimate the posterior distribution of predictions. This approach eliminates the need for the iterative and resource-intensive training typically required by GNNs. To enhance the capability of the GCGP in capturing dependencies between function values, we derive a specialized covariance function that incorporates structural information. This covariance function broadens the receptive field of input nodes by local neighborhood aggregation, thereby facilitating the representation of intricate dependencies within the nodes. To address the challenge of optimizing binary structural information in condensed graphs, Concrete random variables are utilized to approximate the binary adjacency matrix in a continuous counterpart. This relaxation process allows the adjacency matrix to be represented in a differentiable form, enabling the application of gradient-based optimization techniques to discrete graph structures. Experimental results show that the proposed GCGP method efficiently condenses large-scale graph data while preserving predictive performance, addressing the scalability and efficiency challenges. The implementation of our method is publicly available at https://github.com/WANGLin0126/GCGP.
♻ ☆ Multi-Turn Human-LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol NeurIPS 2025
Our interest is in the design of software systems involving a human-expert interacting -- using natural language -- with a large language model (LLM) on data analysis tasks. For complex problems, it is possible that LLMs can harness human expertise and creativity to find solutions that were otherwise elusive. On one level, this interaction takes place through multiple turns of prompts from the human and responses from the LLM. Here we investigate a more structured approach based on an abstract protocol described in [3] for interaction between agents. The protocol is motivated by a notion of "two-way intelligibility" and is modelled by a pair of communicating finite-state machines. We provide an implementation of the protocol, and provide empirical evidence of using the implementation to mediate interactions between an LLM and a human-agent in two areas of scientific interest (radiology and drug design). We conduct controlled experiments with a human proxy (a database), and uncontrolled experiments with human subjects. The results provide evidence in support of the protocol's capability of capturing one- and two-way intelligibility in human-LLM interaction; and for the utility of two-way intelligibility in the design of human-machine systems. Our code is available at https://github.com/karannb/interact.
comment: Multi-Turn Interactions in Large Language Models (MTI-LLM) Workshop at NeurIPS 2025
♻ ☆ Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection
Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.
comment: Accepted at EMNLP 2025 Findings (Short)
♻ ☆ Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.
comment: Original version uploaded on Sep 22, 2025. (v2): Extended Table 2 with additional analysis and referenced it in Sec 5.2
♻ ☆ Language Model Embeddings Can Be Sufficient for Bayesian Optimization
Bayesian Optimization is ubiquitous in experimental design and black-box optimization for improving search efficiency. However, most existing approaches rely on regression models which are limited to fixed search spaces and structured, tabular input features. This paper explores the use of LLM embeddings over string inputs for in-context regression in Bayesian Optimization. Our results show that representing inputs as strings enables general-purpose regression across diverse domains, including synthetic, combinatorial, and hyperparameter optimization. Furthermore, our approach achieves optimization performance comparable to state-of-the-art Gaussian Process-based methods such as Google Vizier, and demonstrates potential for broader and more flexible applications.
comment: Code can be found in https://github.com/google-research/optformer/tree/main/optformer/embed_then_regress
♻ ☆ More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.
comment: 20 pages, 5 figures
♻ ☆ Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
Downstream scaling laws aim to predict task performance at larger scales from the model's performance at smaller scales. Whether such prediction should be possible is unclear: some works discover clear linear scaling trends after simple transformations of the performance metric, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, and we find that predictable scaling only occurs in a minority of cases: 39% of the time. Moreover, seemingly benign changes to the experimental setting can completely change the scaling behavior. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To accurately model the relationship between pretraining loss and task performance, we must embrace the cases in which scaling behavior deviates from linear trends.
comment: EMNLP Findings 2025 Camera Ready
♻ ☆ LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization
Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO.
♻ ☆ A Survey of Reinforcement Learning for Large Reasoning Models
In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
comment: Fixed typos; added missing and recent citations (117 -> 120 pages)
♻ ☆ Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
comment: 21 pages
♻ ☆ Data-Error Scaling Laws in Machine Learning on Combinatorial Mutation-prone Sets: Proteins and Small Molecules
We investigate trends in the data-error scaling laws of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computational and experimental training data. Our synthetic datasets comprised i) two na\"ive functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs, while the experimental dataset consisted of a full deep mutational scan of the binding protein GB1. In contrast to typical data-error scaling laws, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and introduce the concept of mutant-based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.
♻ ☆ Machine-Learning Driven Load Shedding to Mitigate Instability Attacks in Power Grids
Critical infrastructures are becoming increasingly complex as our society becomes increasingly dependent on them. This complexity opens the door to new possibilities for attacks and a need for new defense strategies. Our work focuses on instability attacks on the power grid, wherein an attacker causes cascading outages by introducing unstable dynamics into the system. When stress is place on the power grid, a standard mitigation approach is load-shedding: the system operator chooses a set of loads to shut off until the situation is resolved. While this technique is standard, there is no systematic approach to choosing which loads will stop an instability attack. This paper addresses this problem using a data-driven methodology for load shedding decisions. We show a proof of concept on the IEEE 14 Bus System using the Achilles Heel Technologies Power Grid Analyzer, and show through an implementation of modified Prony analysis (MPA) that MPA is a viable method for detecting instability attacks and triggering defense mechanisms.
♻ ☆ Cost-aware Stopping for Bayesian Optimization
In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora's Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function usually matches or outperforms other acquisition-function--stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.
♻ ☆ On The Sample Complexity Bounds In Bilevel Reinforcement Learning
Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-{\L}ojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of $\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.
comment: This is updated version of the paper 2410.15610
♻ ☆ Neuro-Symbolic Agents with Modal Logic for Autonomous Diagnostics
The development of intelligent agents, particularly those powered by language models (LMs), has shown the critical role in various environments that require intelligent and autonomous decision. Environments are not passive testing grounds and they represent the data required for agents to learn and exhibit very challenging conditions that require adaptive, complex and autonomous capacity to make decisions. While the paradigm of scaling models and datasets has led to remarkable emergent capabilities, we argue that scaling the structure, fidelity, and logical consistency of agent reasoning within these environments is a crucial, yet underexplored, dimension of AI research. This paper introduces a neuro-symbolic multi-agent architecture where the belief states of individual agents are formally represented as Kripke models. This foundational choice enables them to reason about known concepts of \emph{possibility} and \emph{necessity} using the formal language of modal logic. In this work, we use of immutable, domain-specific knowledge to make infere information, which is encoded as logical constraints essential for proper diagnosis. In the proposed model, we show constraints that actively guide the hypothesis generation of LMs, effectively preventing them from reaching physically or logically untenable conclusions. In a high-fidelity simulated particle accelerator environment, our system successfully diagnoses complex, cascading failures by combining the powerful semantic intuition of LMs with the rigorous, verifiable validation of modal logic and a factual world model and showcasing a viable path toward more robust, reliable, and verifiable autonomous agents.
comment: 10 pages, 1 figure, Scaling Environments for Agents (SEA) Workshop at NeuralIPS
♻ ☆ Real-time Noise Detection and Classification in Single-Channel EEG: A Lightweight Machine Learning Approach for EMG, White Noise, and EOG Artifacts
Electroencephalogram (EEG) artifact detection in real-world settings faces significant challenges such as computational inefficiency in multi-channel methods, poor robustness to simultaneous noise, and trade-offs between accuracy and complexity in deep learning models. We propose a hybrid spectral-temporal framework for real-time detection and classification of ocular (EOG), muscular (EMG), and white noise artifacts in single-channel EEG. This method, in contrast to other approaches, combines time-domain low-pass filtering (targeting low-frequency EOG) and frequency-domain power spectral density (PSD) analysis (capturing broad-spectrum EMG), followed by PCA-optimized feature fusion to minimize redundancy while preserving discriminative information. This feature engineering strategy allows a lightweight multi-layer perceptron (MLP) architecture to outperform advanced CNNs and RNNs by achieving 99% accuracy at low SNRs (SNR -7) dB and >90% accuracy in moderate noise (SNR 4 dB). Additionally, this framework addresses the unexplored problem of simultaneous multi-source contamination(EMG+EOG+white noise), where it maintains 96% classification accuracy despite overlapping artifacts. With 30-second training times (97% faster than CNNs) and robust performance across SNR levels, this framework bridges the gap between clinical applicability and computational efficiency, which enables real-time use in wearable brain-computer interfaces. This work also challenges the ubiquitous dependence on model depth for EEG artifact detection by demonstrating that domain-informed feature fusion surpasses complex architecture in noisy scenarios.
♻ ☆ AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations
Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models (from 70M to 14B parameters) demonstrate that this complexity-driven adaptation significantly outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the computational burden of extensive hyperparameter tuning.
♻ ☆ Shape-Informed Clustering of Multi-Dimensional Functional Data via Deep Functional Autoencoders
We introduce FAEclust, a novel functional autoencoder framework for cluster analysis of multi-dimensional functional data, data that are random realizations of vector-valued random functions. Our framework features a universal-approximator encoder that captures complex nonlinear interdependencies among component functions, and a universal-approximator decoder capable of accurately reconstructing both Euclidean and manifold-valued functional data. Stability and robustness are enhanced through innovative regularization strategies applied to functional weights and biases. Additionally, we incorporate a clustering loss into the network's training objective, promoting the learning of latent representations that are conducive to effective clustering. A key innovation is our shape-informed clustering objective, ensuring that the clustering results are resistant to phase variations in the functions. We establish the universal approximation property of our non-linear decoder and validate the effectiveness of our model through extensive experiments.
♻ ☆ Uncertainty Comes for Free: Human-in-the-Loop Policies with Diffusion Models
Human-in-the-loop (HitL) robot deployment has gained significant attention in both academia and industry as a semi-autonomous paradigm that enables human operators to intervene and adjust robot behaviors at deployment time, improving success rates. However, continuous human monitoring and intervention can be highly labor-intensive and impractical when deploying a large number of robots. To address this limitation, we propose a method that allows diffusion policies to actively seek human assistance only when necessary, reducing reliance on constant human oversight. To achieve this, we leverage the generative process of diffusion policies to compute an uncertainty-based metric based on which the autonomous agent can decide to request operator assistance at deployment time, without requiring any operator interaction during training. Additionally, we show that the same method can be used for efficient data collection for fine-tuning diffusion policies in order to improve their autonomous performance. Experimental results from simulated and real-world environments demonstrate that our approach enhances policy performance during deployment for a variety of scenarios.
♻ ☆ Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective NeurIPS2024
State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.
comment: NeurIPS2024. Code:https://github.com/QishuaiWen/DEPICT/
♻ ☆ Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.
♻ ☆ The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology
Existing interpretability methods for Large Language Models (LLMs) often fall short by focusing on linear directions or isolated features, overlooking the high-dimensional, nonlinear, and relational geometry within model representations. This study focuses on how adversarial inputs systematically affect the internal representation spaces of LLMs, a topic which remains poorly understood. We propose persistent homology (PH), a tool from topological data analysis, as a principled framework to characterize the multi-scale dynamics within LLM activations. Using PH, we systematically analyze six state-of-the-art models under two distinct adversarial conditions, indirect prompt injection and backdoor fine-tuning, and identify a consistent topological signature of adversarial influence. Across architectures and model sizes, adversarial inputs induce ``topological compression'', where the latent space becomes structurally simpler, collapsing from varied, compact, small-scale features into fewer, dominant, and more dispersed large-scale ones. This topological signature is statistically robust across layers, highly discriminative, and provides interpretable insights into how adversarial effects emerge and propagate. By quantifying the shape of activations and neuronal information flow, our architecture-agnostic framework reveals fundamental invariants of representational change, offering a complementary perspective to existing interpretability methods.
♻ ☆ Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation
Modern machine learning (ML) training workloads place substantial demands on both computational and communication resources. Consequently, accurate performance estimation has become increasingly critical for guiding system design decisions, such as the selection of parallelization strategies, cluster configurations, and hardware provisioning. Existing simulation-based performance estimation requires reimplementing the ML framework in a simulator, which demands significant manual effort and is hard to maintain as ML frameworks evolve rapidly. This paper introduces Phantora, a hybrid GPU cluster simulator designed for performance estimation of ML training workloads. Phantora executes unmodified ML frameworks as is within a distributed, containerized environment. Each container emulates the behavior of a GPU server in a large-scale cluster, while Phantora intercepts and simulates GPU- and communication-related operations to provide high-fidelity performance estimation. We call this approach hybrid simulation of ML systems, in contrast to traditional methods that simulate static workloads. The primary advantage of hybrid simulation is that it allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation. Our evaluation shows that Phantora provides accuracy comparable to static workload simulation while supporting three state-of-the-art LLM training frameworks out-of-the-box. In addition, Phantora operates on a single GPU, eliminating the need for the resource-intensive trace collection and workload extraction steps required by traditional trace-based simulators. Phantora is open-sourced at https://github.com/QDelta/Phantora.
♻ ☆ Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity
Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient "curl"-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.
♻ ☆ Interpreting GNN-based IDS Detections Using Provenance Graph Structural Features
Advanced cyber threats (e.g., Fileless Malware and Advanced Persistent Threat (APT)) have driven the adoption of provenance-based security solutions. These solutions employ Machine Learning (ML) models for behavioral modeling and critical security tasks such as malware and anomaly detection. However, the opacity of ML-based security models limits their broader adoption, as the lack of transparency in their decision-making processes restricts explainability and verifiability. We tailored our solution towards Graph Neural Network (GNN)-based security solutions since recent studies employ GNNs to comprehensively digest system provenance graphs for security-critical tasks. To enhance the explainability of GNN-based security models, we introduce PROVEXPLAINER, a framework offering instance-level security-aware explanations using an interpretable surrogate model. PROVEXPLAINER's interpretable feature space consists of discriminant subgraph patterns and graph structural features, which can be directly mapped to the system provenance problem space, making the explanations human interpretable. We show how PROVEXPLAINER synergizes with current state-of-the-art (SOTA) GNN explainers to deliver domain and instance-specific explanations. We measure the explanation quality using the Fidelity+/Fidelity- metric as used by traditional GNN explanation literature, we incorporate the precision/recall metric, where we consider the accuracy of the explanation against the ground truth, and we designed a human actionability metric based on graph traversal distance. On real-world Fileless and APT datasets, PROVEXPLAINER achieves up to 29%/27%/25%/1.4x higher Fidelity+, precision, recall, and actionability (where higher values are better), and 12% lower Fidelity- (where lower values are better) when compared against SOTA GNN explainers.
♻ ☆ Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning
Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a \emph{Physics-informed (Pi)} regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.
♻ ☆ HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions
We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a "delta" PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.
♻ ☆ Anticipating the Selectivity of Intramolecular Cyclization Reaction Pathways with Neural Network Potentials
Reaction mechanism search tools have demonstrated the ability to provide insights into likely products and rate-limiting steps of reacting systems. However, reactions involving several concerted bond changes - as can be found in many key steps of natural product synthesis - can complicate the search process. To mitigate these complications, we present a mechanism search strategy particularly suited to help expedite exploration of an exemplary family of such complex reactions, cyclizations. We provide a cost-effective strategy for identifying relevant elementary reaction steps by combining graph-based enumeration schemes and machine learning techniques for intermediate filtering. Key to this approach is our use of a neural network potential (NNP), AIMNet2-rxn, for computational evaluation of each candidate reaction pathway. In this article, we evaluate the NNP's ability to estimate activation energies, demonstrate the correct anticipation of stereoselectivity, and recapitulate complex enabling steps in natural product synthesis.
comment: 32 pages, 5 figures
♻ ☆ Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.
♻ ☆ InfoPos: A Design Support Framework for ML-Assisted Fault Detection and Identification in Industrial Cyber-Physical Systems
The variety of building blocks and algorithms incorporated in data-centric and ML-assisted fault detection and identification solutions is high, contributing to two challenges: selection of the most effective set and order of building blocks, as well as achieving such a selection with minimum cost. Considering that ML-assisted solution design is influenced by the extent of available data and the extent of available knowledge of the target system, it is advantageous to be able to select effective and matching building blocks. We introduce the first iteration of our InfoPos framework, allowing the placement of fault detection/identification use-cases based on the available levels (positions), i.e., from poor to rich, of knowledge and data dimensions. With that input, designers and developers can reveal the most effective corresponding choice(s), streamlining the solution design process. The results from a demonstrator, a fault identification use-case for industrial Cyber-Physical Systems, reflects achieved effects when different building blocks are used throughout knowledge and data positions. The achieved ML model performance is considered as the indicator for a better solution. The data processing code and composed datasets are publicly available.
♻ ☆ Spatial-Functional awareness Transformer-based graph archetype contrastive learning for Decoding Visual Neural Representations from EEG
Decoding visual neural representations from Electroencephalography (EEG) signals remains a formidable challenge due to their high-dimensional, noisy, and non-Euclidean nature. In this work, we propose a Spatial-Functional Awareness Transformer-based Graph Archetype Contrastive Learning (SFTG) framework to enhance EEG-based visual decoding. Specifically, we introduce the EEG Graph Transformer (EGT), a novel graph-based neural architecture that simultaneously encodes spatial brain connectivity and temporal neural dynamics. To mitigate high intra-subject variability, we propose Graph Archetype Contrastive Learning (GAC), which learns subject-specific EEG graph archetypes to improve feature consistency and class separability. Furthermore, we conduct comprehensive subject-dependent and subject-independent evaluations on the Things-EEG dataset, demonstrating that our approach significantly outperforms prior state-of-the-art EEG decoding methods.The results underscore the transformative potential of integrating graph-based learning with contrastive objectives to enhance EEG-based brain decoding, paving the way for more generalizable and robust neural representations.
♻ ☆ Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs NeurIPS 2025
Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshka Pilot (M-Pilot), a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with M-Pilot serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. M-Pilot is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on diverse tasks demonstrate that our method effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks.
comment: Accepted by NeurIPS 2025
♻ ☆ ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification ICCV 2025
Bone health studies are crucial in medical practice for the early detection and treatment of Osteopenia and Osteoporosis. Clinicians usually make a diagnosis based on densitometry (DEXA scans) and patient history. The applications of AI in this field are ongoing research. Most successful methods rely on deep learning models that use vision alone (DEXA/X-ray imagery) and focus on prediction accuracy, while explainability is often disregarded and left to post hoc assessments of input contributions. We propose ProtoMedX, a multi-modal (multimodal) model that uses both DEXA scans of the lumbar spine and patient records. ProtoMedX's prototype-based architecture is explainable by design, which is crucial for medical applications, especially in the context of the upcoming EU AI Act, as it allows explicit analysis of model decisions, including incorrect ones. ProtoMedX demonstrates state-of-the-art performance in bone health classification while also providing explanations that can be visually understood by clinicians. Using a dataset of 4,160 real NHS patients, the proposed ProtoMedX achieves 87.58% accuracy in vision-only tasks and 89.8% in its multi-modal variant, both surpassing existing published methods.
comment: ICCV 2025 (PHAROS-AFE-AIMI: Adaptation, Fairness, and Explainability in Medical Imaging). 8 pages, 5 figures, 4 tables. Keywords: multi-modal, multimodal, prototype learning, explainable AI, interpretable models, case-based reasoning, medical imaging, DEXA, bone health, osteoporosis, osteopenia, diagnosis, classification, clustering
♻ ☆ LLM Fingerprinting via Semantically Conditioned Watermarks
Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.
♻ ☆ Recurrent Natural Policy Gradient for POMDPs
Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.
♻ ☆ $μ$-Parametrization for Mixture of Experts
Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $\mu$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $\mu$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.
♻ ☆ Latency-Aware Contextual Bandit: Application to Cryo-EM Data Collection
We introduce a latency-aware contextual bandit framework that generalizes the standard contextual bandit problem, where the learner adaptively selects arms and switches decision sets under action delays. In this setting, the learner observes the context and may select multiple arms from a decision set, with the total time determined by the selected subset. The problem can be framed as a special case of semi-Markov decision processes (SMDPs), where contexts and latencies are drawn from an unknown distribution. Leveraging the Bellman optimality equation, we design the contextual online arm filtering (COAF) algorithm, which balances exploration, exploitation, and action latency to minimize regret relative to the optimal average-reward policy. We analyze the algorithm and show that its regret upper bounds match established results in the contextual bandit literature. In numerical experiments on a movie recommendation dataset and cryogenic electron microscopy (cryo-EM) data, we demonstrate that our approach efficiently maximizes cumulative reward over time.
♻ ☆ Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning
Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.
♻ ☆ Adaptive Collaborative Correlation Learning-based Semi-Supervised Multi-Label Feature Selection
Semi-supervised multi-label feature selection has recently been developed to solve the curse of dimensionality problem in high-dimensional multi-label data with certain samples missing labels. Although many efforts have been made, most existing methods use a predefined graph approach to capture the sample similarity or the label correlation. In this manner, the presence of noise and outliers within the original feature space can undermine the reliability of the resulting sample similarity graph. It also fails to precisely depict the label correlation due to the existence of unknown labels. Besides, these methods only consider the discriminative power of selected features, while neglecting their redundancy. In this paper, we propose an Adaptive Collaborative Correlation lEarning-based Semi-Supervised Multi-label Feature Selection (Access-MFS) method to address these issues. Specifically, a generalized regression model equipped with an extended uncorrelated constraint is introduced to select discriminative yet irrelevant features and maintain consistency between predicted and ground-truth labels in labeled data, simultaneously. Then, the instance correlation and label correlation are integrated into the proposed regression model to adaptively learn both the sample similarity graph and the label similarity graph, which mutually enhance feature selection performance. Extensive experimental results demonstrate the superiority of the proposed Access-MFS over other state-of-the-art methods.
♻ ☆ BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
comment: Under Review
♻ ☆ TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization
Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.
comment: fixed typos in the proof
♻ ☆ Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?
Style-conditioned data poisoning is identified as a covert vector for amplifying sociolinguistic bias in large language models. Using small poisoned budgets that pair dialectal prompts -- principally African American Vernacular English (AAVE) and a Southern dialect -- with toxic or stereotyped completions during instruction tuning, this work probes whether linguistic style can act as a latent trigger for harmful behavior. Across multiple model families and scales, poisoned exposure elevates toxicity and stereotype expression for dialectal inputs -- most consistently for AAVE -- while Standard American English remains comparatively lower yet not immune. A multi-metric audit combining classifier-based toxicity with an LLM-as-a-judge reveals stereotype-laden content even when lexical toxicity appears muted, indicating that conventional detectors under-estimate sociolinguistic harms. Additionally, poisoned models exhibit emergent jailbreaking despite the absence of explicit slurs in the poison, suggesting weakened alignment rather than memorization. These findings underscore the need for dialect-aware evaluation, content-level stereotype auditing, and training protocols that explicitly decouple style from toxicity to prevent bias amplification through seemingly minor, style-based contamination.
♻ ☆ Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation
While Bayesian inference provides a principled framework for reasoning under uncertainty, its widespread adoption is limited by the intractability of exact posterior computation, necessitating the use of approximate inference. However, existing methods are often computationally expensive, or demand costly retraining when priors change, limiting their utility, particularly in sequential inference problems such as real-time sensor fusion. To address these challenges, we introduce the Distribution Transformer -- a novel architecture that can learn arbitrary distribution-to-distribution mappings. Our method can be trained to map a prior to the corresponding posterior, conditioned on some dataset -- thus performing approximate Bayesian inference. Our novel architecture represents a prior distribution as a (universally-approximating) Gaussian Mixture Model (GMM), and transforms it into a GMM representation of the posterior. The components of the GMM attend to each other via self-attention, and to the datapoints via cross-attention. We demonstrate that Distribution Transformers both maintain flexibility to vary the prior, and significantly reduces computation times-from minutes to milliseconds-while achieving log-likelihood performance on par with or superior to existing approximate inference methods across tasks such as sequential inference, quantum system parameter inference, and Gaussian Process predictive posterior inference with hyperpriors.
♻ ☆ Disparate Conditional Prediction in Multiclass Classifiers ICML 2025
We propose methods for auditing multiclass classifiers for fairness under multiclass equalized odds,by estimating the deviation from equalized odds when the classifier is not completely fair. We generalize to multiclass classifiers the measure of Disparate Conditional Prediction (DCP), originally suggested by Sabato & Yom-Tov (2020) for binary classifiers. DCP is defined as the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline. We provide new local-optimization methods for estimating the multiclass DCPunder two different regimes,one in which the conditional confusion matrices for each protected sub-population are known, and one in which these cannot be estimated, for instance, because the classifier is inaccessible or because good-quality individual-level data is not available. These methods can be used to detect classifiers that likely treat a significant fraction of the population unfairly. Experiments demonstrate the accuracy of the methods. Code is provided at https://github.com/sivansabato/DCPmulticlass.
comment: Published at ICML 2025
♻ ☆ PFAttack: Stealthy Attack Bypassing Group Fairness in Federated Learning
Federated learning (FL), integrating group fairness mechanisms, allows multiple clients to collaboratively train a global model that makes unbiased decisions for different populations grouped by sensitive attributes (e.g., gender and race). Due to its distributed nature, previous studies have demonstrated that FL systems are vulnerable to model poisoning attacks. However, these studies primarily focus on perturbing accuracy, leaving a critical question unexplored: Can an attacker bypass the group fairness mechanisms in FL and manipulate the global model to be biased? The motivations for such an attack vary; an attacker might seek higher accuracy, yet fairness considerations typically limit the accuracy of the global model or aim to cause ethical disruption. To address this question, we design a novel form of attack in FL, termed Profit-driven Fairness Attack (PFAttack), which aims not to degrade global model accuracy but to bypass fairness mechanisms. Our fundamental insight is that group fairness seeks to weaken the dependence of outputs on input attributes related to sensitive information. In the proposed PFAttack, an attacker can recover this dependence through local fine-tuning across various sensitive groups, thereby creating a biased yet accuracy-preserving malicious model and injecting it into FL through model replacement. Compared to attacks targeting accuracy, PFAttack is more stealthy. The malicious model in PFAttack exhibits subtle parameter variations relative to the original global model, making it robust against detection and filtering by Byzantine-resilient aggregations. Extensive experiments on benchmark datasets are conducted for four fair FL frameworks and three Byzantine-resilient aggregations against model poisoning, demonstrating the effectiveness and stealth of PFAttack in bypassing group fairness mechanisms in FL.
♻ ☆ VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones
Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.
comment: 19 pages
♻ ☆ Generalized Orders of Magnitude for Scalable, Parallel, High-Dynamic-Range Computation
Many domains, from deep learning to finance, require compounding real numbers over long sequences, often leading to catastrophic numerical underflow or overflow. We introduce generalized orders of magnitude (GOOMs), a principled extension of traditional orders of magnitude that incorporates floating-point numbers as a special case, and which in practice enables stable computation over significantly larger dynamic ranges of real numbers than previously possible. We implement GOOMs, along with an efficient custom parallel prefix scan, to support native execution on parallel hardware such as GPUs. We demonstrate that our implementation of GOOMs outperforms traditional approaches with three representative experiments, all of which were previously considered impractical or impossible, and now become possible and practical: (1) compounding real matrix products far beyond standard floating-point limits; (2) estimating spectra of Lyapunov exponents in parallel, orders of magnitude faster than with previous methods, applying a novel selective-resetting method to prevent state colinearity; and (3) capturing long-range dependencies in deep recurrent neural networks with non-diagonal recurrent states, computed in parallel via a prefix scan, without requiring any form of stabilization. Our results show that our implementation of GOOMs, combined with efficient parallel scanning, offers a scalable and numerically robust alternative to conventional floating-point numbers for high-dynamic-range applications.
comment: 18 pages, 4 figures (main text). 14 pages, 21 figures (appendix). Code is at https://github.com/glassroom/generalized_orders_of_magnitude
♻ ☆ Rethinking Losses for Diffusion Bridge Samplers NeurIPS 2025
Diffusion bridges are a promising class of deep-learning methods for sampling from unnormalized distributions. Recent works show that the Log Variance (LV) loss consistently outperforms the reverse Kullback-Leibler (rKL) loss when using the reparametrization trick to compute rKL-gradients. While the on-policy LV loss yields identical gradients to the rKL loss when combined with the log-derivative trick for diffusion samplers with non-learnable forward processes, this equivalence does not hold for diffusion bridges or when diffusion coefficients are learned. Based on this insight we argue that for diffusion bridges the LV loss does not represent an optimization objective that can be motivated like the rKL loss via the data processing inequality. Our analysis shows that employing the rKL loss with the log-derivative trick (rKL-LD) does not only avoid these conceptual problems but also consistently outperforms the LV loss. Experimental results with different types of diffusion bridges on challenging benchmarks show that samplers trained with the rKL-LD loss achieve better performance. From a practical perspective we find that rKL-LD requires significantly less hyperparameter optimization and yields more stable training behavior.
comment: Accepted at NeurIPS 2025 as a Conference Paper
♻ ☆ TASP: Topology-aware Sequence Parallelism
Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at https://github.com/infinigence/HamiltonAttention.
♻ ☆ OneForecast: A Universal Framework for Global and Regional Weather Forecasting
Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link https://github.com/YuanGao-YG/OneForecast.
♻ ☆ Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack's effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.
♻ ☆ OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (CoT) tasks. However, existing frameworks commonly face challenges such as inference bottlenecks and complexity barriers, which restrict their accessibility to newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency, with speedups ranging from 1.22x to 1.68x across different model sizes, compared to state-of-the-art frameworks. Additionally, it requires significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.
comment: update template
♻ ☆ Multi-Continental Healthcare Modelling Using Blockchain-Enabled Federated Learning
One of the biggest challenges of building artificial intelligence (AI) model in the healthcare area is the data sharing. Since healthcare data is private, sensitive, and heterogeneous, collecting sufficient data for modelling is exhausting, costly, and sometimes impossible. In this paper, we propose a framework for global healthcare modelling using datasets from multi-continents (Europe, North America, and Asia) without sharing the local datasets, and choose glucose management as a study model to verify its effectiveness. Technically, blockchain-enabled federated learning is implemented with adaptation to meet the privacy and safety requirements of healthcare data, meanwhile, it rewards honest participation and penalizes malicious activities using its on-chain incentive mechanism. Experimental results show that the proposed framework is effective, efficient, and privacy-preserving. Its prediction accuracy consistently outperforms models trained on limited personal data and achieves comparable or even slightly better results than centralized training in certain scenarios, all while preserving data privacy. This work paves the way for international collaborations on healthcare projects, where additional data is crucial for reducing bias and providing benefits to humanity.
comment: Accepted by IEEE Global Blockchain Conference, 2025
♻ ☆ Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.
comment: Accepted by SIGIR-AP25
♻ ☆ TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of TokenSelect demonstrates up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.
comment: Accepted by EMNLP2025
♻ ☆ Panorama: Fast-Track Nearest Neighbors
Approximate Nearest-Neighbor Search (ANNS) efficiently finds data items whose embeddings are close to that of a given query in a high-dimensional space, aiming to balance accuracy with speed. Used in recommendation systems, image and video retrieval, natural language processing, and retrieval-augmented generation (RAG), ANNS algorithms such as IVFPQ, HNSW graphs, Annoy, and MRPT utilize graph, tree, clustering, and quantization techniques to navigate large vector spaces. Despite this progress, ANNS systems spend up to 99\% of query time to compute distances in their final refinement phase. In this paper, we present PANORAMA, a machine learning-driven approach that tackles the ANNS verification bottleneck through data-adaptive learned orthogonal transforms that facilitate the accretive refinement of distance bounds. Such transforms compact over 90\% of signal energy into the first half of dimensions, enabling early candidate pruning with partial distance computations. We integrate PANORAMA into state-of-the-art ANNS methods, namely IVFPQ/Flat, HNSW, MRPT, and Annoy, without index modification, using level-major memory layouts, SIMD-vectorized partial distance computations, and cache-aware access patterns. Experiments across diverse datasets -- from image-based CIFAR-10 and GIST to modern embedding spaces including OpenAI's Ada 2 and Large 3 -- demonstrate that PANORAMA affords a 2--30$\times$ end-to-end speedup with no recall loss.
♻ ☆ Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning
This paper provides the first expert sample complexity characterization for learning a Nash equilibrium from expert data in Markov Games. We show that a new quantity named the single policy deviation concentrability coefficient is unavoidable in the non-interactive imitation learning setting, and we provide an upper bound for behavioral cloning (BC) featuring such coefficient. BC exhibits substantial regret in games with high concentrability coefficient, leading us to utilize expert queries to develop and introduce two novel solution algorithms: MAIL-BRO and MURMAIL. The former employs a best response oracle and learns an $\varepsilon$-Nash equilibrium with $\mathcal{O}(\varepsilon^{-4})$ expert and oracle queries. The latter bypasses completely the best response oracle at the cost of a worse expert query complexity of order $\mathcal{O}(\varepsilon^{-8})$. Finally, we provide numerical evidence, confirming our theoretical findings.
♻ ☆ Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning ICLR 2026
Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks.
comment: 9 pages main, 25 pages appendix with reference. Submitted to ICLR 2026
♻ ☆ RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation ICCV 2025
Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
comment: Accept by ICCV 2025 (Highlight). Project website: https://colorful-liyu.github.io/RAGDiffusion-page/
♻ ☆ FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification NeurIPS 2025
Medical image classification requires not only high predictive performance but also interpretability to ensure clinical trust and adoption. Graph Neural Networks (GNNs) offer a powerful framework for modeling relational structures within datasets; however, standard GNNs often operate as black boxes, limiting transparency and usability, particularly in clinical settings. In this work, we present an interpretable graph-based learning framework named FireGNN that integrates trainable fuzzy rules into GNNs for medical image classification. These rules embed topological descriptors - node degree, clustering coefficient, and label agreement - using learnable thresholds and sharpness parameters to enable intrinsic symbolic reasoning. Additionally, we explore auxiliary self-supervised tasks (e.g., homophily prediction, similarity entropy) as a benchmark to evaluate the contribution of topological learning. Our fuzzy-rule-enhanced model achieves strong performance across five MedMNIST benchmarks and the synthetic dataset MorphoMNIST, while also generating interpretable rule-based explanations. To our knowledge, this is the first integration of trainable fuzzy rules within a GNN. Source Code: https://github.com/basiralab/FireGNN
comment: Accepted at NeurIPS 2025 Conference (Workshop Track), San Diego, USA
♻ ☆ Maintaining Performance with Less Data
We propose a novel method for training a neural network for image classification to reduce input data dynamically, in order to reduce the costs of training a neural network model. As Deep Learning tasks become more popular, their computational complexity increases, leading to more intricate algorithms and models which have longer runtimes and require more input data. The result is a greater cost on time, hardware, and environmental resources. By using data reduction techniques, we reduce the amount of work performed, and therefore the environmental impact of AI techniques, and with dynamic data reduction we show that accuracy may be maintained while reducing runtime by up to 50%, and reducing carbon emission proportionally.
comment: 12 pages, 8 figures, 11 tables
♻ ☆ OBSR: Open Benchmark for Spatial Representations
GeoAI is evolving rapidly, fueled by diverse geospatial datasets like traffic patterns, environmental data, and crowdsourced OpenStreetMap (OSM) information. While sophisticated AI models are being developed, existing benchmarks are often concentrated on single tasks and restricted to a single modality. As such, progress in GeoAI is limited by the lack of a standardized, multi-task, modality-agnostic benchmark for their systematic evaluation. This paper introduces a novel benchmark designed to assess the performance, accuracy, and efficiency of geospatial embedders. Our benchmark is modality-agnostic and comprises 7 distinct datasets from diverse cities across three continents, ensuring generalizability and mitigating demographic biases. It allows for the evaluation of GeoAI embedders on various phenomena that exhibit underlying geographic processes. Furthermore, we establish a simple and intuitive task-oriented model baselines, providing a crucial reference point for comparing more complex solutions.
comment: ACM SIGSPATIAL 2025 Full Paper
♻ ☆ (Token-Level) InfoRMIA: Stronger Membership Inference and Memorization Assessment for LLMs
Machine learning models are known to leak sensitive information, as they inevitably memorize (parts of) their training data. More alarmingly, large language models (LLMs) are now trained on nearly all available data, which amplifies the magnitude of information leakage and raises serious privacy risks. Hence, it is more crucial than ever to quantify privacy risk before the release of LLMs. The standard method to quantify privacy is via membership inference attacks, where the state-of-the-art approach is the Robust Membership Inference Attack (RMIA). In this paper, we present InfoRMIA, a principled information-theoretic formulation of membership inference. Our method consistently outperforms RMIA across benchmarks while also offering improved computational efficiency. In the second part of the paper, we identify the limitations of treating sequence-level membership inference as the gold standard for measuring leakage. We propose a new perspective for studying membership and memorization in LLMs: token-level signals and analyses. We show that a simple token-based InfoRMIA can pinpoint which tokens are memorized within generated outputs, thereby localizing leakage from the sequence level down to individual tokens, while achieving stronger sequence-level inference power on LLMs. This new scope rethinks privacy in LLMs and can lead to more targeted mitigation, such as exact unlearning.
♻ ☆ Foundation Models for Structural Health Monitoring
Structural Health Monitoring (SHM) is a critical task for ensuring the safety and reliability of civil infrastructures, typically realized on bridges and viaducts by means of vibration monitoring. In this paper, we propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for SHM. We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training, which, coupled with task-specific fine-tuning, allows them to outperform state-of-the-art traditional methods on diverse tasks, including Anomaly Detection (AD) and Traffic Load Estimation (TLE). We then extensively explore model size versus accuracy trade-offs and experiment with Knowledge Distillation (KD) to improve the performance of smaller Transformers, enabling their embedding directly into the SHM edge nodes. We showcase the effectiveness of our foundation models using data from three operational viaducts. For AD, we achieve a near-perfect 99.9% accuracy with a monitoring time span of just 15 windows. In contrast, a state-of-the-art method based on Principal Component Analysis (PCA) obtains its first good result (95.03% accuracy), only considering 120 windows. On two different TLE tasks, our models obtain state-of-the-art performance on multiple evaluation metrics (R$^2$ score, MAE% and MSE%). On the first benchmark, we achieve an R$^2$ score of 0.97 and 0.90 for light and heavy vehicle traffic, respectively, while the best previous approach (a Random Forest) stops at 0.91 and 0.84. On the second one, we achieve an R$^2$ score of 0.54 versus the 0.51 of the best competitor method, a Long-Short Term Memory network.
comment: 17 pages, 6 tables, 9 figures
♻ ☆ Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation
While artificial neural networks are known as universal approximators for continuous functions, many modern approaches rely on overparameterized architectures with high computational cost. In this work, we introduce the Barycentric Neural Network (BNN): a compact shallow architecture that encodes both structure and parameters through a fixed set of base points and their associated barycentric coordinates. We show that the BNN enables the exact representation of continuous piecewise linear functions (CPLFs), ensuring strict continuity across segments. Given that any continuous function on a compact domain can be uniformly approximated by CPLFs, the BNN emerges as a flexible and interpretable tool for function approximation. To enhance geometric fidelity in low-resource scenarios, such as those with few base points to create BNNs or limited training epochs, we propose length-weighted persistent entropy (LWPE): a stable variant of persistent entropy. Our approach integrates the BNN with a loss function based on LWPE to optimize the base points that define the BNN, rather than its internal parameters. Experimental results show that our approach achieves superior and faster approximation performance compared to standard losses (MSE, RMSE, MAE and LogCosh), offering a computationally sustainable alternative for function approximation.
♻ ☆ Exact Causal Attention with 10% Fewer Operations
We present Exact Causal Attention (ECA), a Strassen-style algorithm that computes exact Causal Attention using 10\% fewer operations. ECA improves a special class of matrix multiplications where either one operand or the output matrix is upper- or lower-triangular. This includes all matrix multiplication operations in the forward and backward pass of Causal Attention, such as masked product $\mathrm{Mask}(QK^{T})$. ECA is built upon algebraic identities discovered via machine learning and combinatorial search. We note that ECA cannot accelerate fused kernels such as FlashAttention on GPU. This is because ECA requires materialization of large intermediate expressions in the memory, while FlashAttention does not. However, it provides an alternative approach for compute-bound applications and can potentially be useful in scenarios with FLOPs considerations.
comment: improved presentation and clarified ambiguous claims
♻ ☆ SmartUT: Receive Beamforming for Spectral Coexistence of NGSO Satellite Systems
In this paper, we investigate downlink co-frequency interference (CFI) mitigation in non-geostationary satellites orbits (NGSOs) co-existing systems. Traditional mitigation techniques, such as Zero-forcing (ZF), produce a null towards the direction of arrivals (DOAs) of the interfering signals, but they suffer from high computational complexity due to matrix inversions and required knowledge of the channel state information (CSI). Furthermore, adaptive beamformers, such as sample matrix inversion (SMI)-based minimum variance, provide poor performance when the available snapshots are limited. We propose a Mamba-based beamformer (MambaBF) that leverages an unsupervised deep learning (DL) approach and can be deployed on the user terminal (UT) antenna array, for assisting downlink beamforming and CFI mitigation using only a limited number of available array snapshots as input, and without CSI knowledge. Simulation results demonstrate that MambaBF consistently outperforms conventional beamforming techniques in mitigating interference and maximizing the signal-to-interference-plus-noise ratio (SINR), particularly under challenging conditions characterized by low SINR, limited snapshots, and imperfect CSI.
♻ ☆ Graphon Mixtures
Social networks have a small number of large hubs, and a large number of small dense communities. We propose a generative model that captures both hub and dense structures. Based on recent results about graphons on line graphs, our model is a graphon mixture, enabling us to generate sequences of graphs where each graph is a combination of sparse and dense graphs. We propose a new condition on sparse graphs (the max-degree), which enables us to identify hubs. We show theoretically that we can estimate the normalized degree of the hubs, as well as estimate the graphon corresponding to sparse components of graph mixtures. We illustrate our approach on synthetic data, citation graphs, and social networks, showing the benefits of explicitly modeling sparse graphs.
♻ ☆ The Poisson Midpoint Method for Langevin Dynamics: Provably Efficient Discretization for Diffusion Models
Langevin Dynamics is a Stochastic Differential Equation (SDE) central to sampling and generative modeling and is implemented via time discretization. Langevin Monte Carlo (LMC), based on the Euler-Maruyama discretization, is the simplest and most studied algorithm. LMC can suffer from slow convergence - requiring a large number of steps of small step-size to obtain good quality samples. This becomes stark in the case of diffusion models where a large number of steps gives the best samples, but the quality degrades rapidly with smaller number of steps. Randomized Midpoint Method has been recently proposed as a better discretization of Langevin dynamics for sampling from strongly log-concave distributions. However, important applications such as diffusion models involve non-log concave densities and contain time varying drift. We propose its variant, the Poisson Midpoint Method, which approximates a small step-size LMC with large step-sizes. We prove that this can obtain a quadratic speed up of LMC under very weak assumptions. We apply our method to diffusion models for image generation and show that it maintains the quality of DDPM with 1000 neural network calls with just 50-80 neural network calls and outperforms ODE based methods with similar compute.
comment: "One often meets his destiny on the road he takes to avoid it" - Master Oogway. My destiny seems to be to write triangle inequalities for the rest of my life
♻ ☆ Empirical evaluation of normalizing flows in Markov Chain Monte Carlo
Recent advances in MCMC use normalizing flows to precondition target distributions and enable jumps to distant regions. However, there is currently no systematic comparison of different normalizing flow architectures for MCMC. As such, many works choose simple flow architectures that are readily available and do not consider other models. Guidelines for choosing an appropriate architecture would reduce analysis time for practitioners and motivate researchers to take the recommended models as foundations to be improved. We provide the first such guideline by extensively evaluating many normalizing flow architectures on various flow-based MCMC methods and target distributions. When the target density gradient is available, we show that flow-based MCMC outperforms classic MCMC for suitable NF architecture choices with minor hyperparameter tuning. When the gradient is unavailable, flow-based MCMC wins with off-the-shelf architectures. We find contractive residual flows to be the best general-purpose models with relatively low sensitivity to hyperparameter choice. We also provide various insights into normalizing flow behavior within MCMC when varying their hyperparameters, properties of target distributions, and the overall computational budget.
comment: 55 pages, 10 figures. Accepted for publication in Machine Learning; post-review, pre-copyediting version
♻ ☆ A pseudo-inverse of a line graph
Line graphs are an alternative representation of graphs where each vertex of the original (root) graph becomes an edge. However not all graphs have a corresponding root graph, hence the transformation from graphs to line graphs is not invertible. We investigate the case when there is a small perturbation in the space of line graphs, and try to recover the corresponding root graph, essentially defining the inverse of the line graph operation. We propose a linear integer program that edits the smallest number of edges in the line graph, that allow a root graph to be found. We use the spectral norm to theoretically prove that such a pseudo-inverse operation is well behaved. Illustrative empirical experiments on Erd\H{o}s-R\'enyi graphs show that our theoretical results work in practice.
♻ ☆ Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models
To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an "imagination-based" environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search-based planning algorithm that leverages non-differentiable domain metrics as reward signals to explore high-return future sequences. These identified high-reward candidates then serve as pseudo-labels to continuously optimize the agent's policy through iterative self-training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.
♻ ☆ SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.
♻ ☆ A Kernel Distribution Closeness Testing
The distribution closeness testing (DCT) assesses whether the distance between a distribution pair is at least $\epsilon$-far. Existing DCT methods mainly measure discrepancies between a distribution pair defined on discrete one-dimensional spaces (e.g., using total variation), which limits their applications to complex data (e.g., images). To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measurement of the distributional discrepancy between two complex distributions, into DCT scenarios. However, we find that MMD's value can be the same for many pairs of distributions that have different norms in the same reproducing kernel Hilbert space (RKHS), making MMD less informative when assessing the closeness levels for multiple distribution pairs. To mitigate the issue, we design a new measurement of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales MMD's value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we finally propose the NAMMD-based DCT to assess the closeness levels of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power compared to MMD-based DCT, with bounded type-I error, which is also validated by extensive experiments on many types of data (e.g., synthetic noise, real images). Furthermore, we also apply the proposed NAMMD for addressing the two-sample testing problem and find NAMMD-based two-sample test has higher test power than the MMD-based two-sample test in both theory and experiments.
♻ ☆ Markets for Models
Motivated by the prevalence of prediction problems in the economy, we study markets in which firms sell models to a consumer to help improve their prediction. Firms decide whether to enter, choose models to train on their data, and set prices. The consumer can purchase multiple models and use a weighted average of the models bought. Market outcomes can be expressed in terms of the \emph{bias-variance decompositions} of the models that firms sell. We give conditions when symmetric firms will choose different modeling techniques, e.g., each using only a subset of available covariates. We also show firms can choose inefficiently biased models or inefficiently costly models to deter entry by competitors.
♻ ☆ Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients
As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we propose FedMosaic, a method that jointly addresses data and model heterogeneity with a task-relevance-aware model aggregation strategy to reduce parameter interference, and a dimension-invariant module that enables knowledge sharing across heterogeneous architectures without huge computational cost. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. The empirical study shows that FedMosaic outperforms the state-of-the-art PFL methods, excelling in both personalization and generalization capabilities under challenging, realistic scenarios.
♻ ☆ Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose \ours\footnote{https://github.com/zhiyuanpeng/EER-FLOPs.} for LLM-based rerankers: RPP (ranking metrics per PetaFLOP), measuring how much ranking quality (e.g., NDCG or MRR) a method achieves per PetaFLOP, and QPP (queries per PetaFLOP), measuring how many queries can be processed per PetaFLOP. Accompanied by the new metrics, an interpretable FLOPs estimator is developed to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architectures, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.
comment: Accepted by EMNLP Industry Track 2025
♻ ☆ Teaching Your Models to Understand Code via Focal Preference Alignment
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
comment: Accepted by EMNLP'25
♻ ☆ MoM: Linear Sequence Modeling with Mixture-of-Memories ICLR 2026
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
comment: Technical report, 18 pages, submitted to ICLR 2026
♻ ☆ Learning General Causal Structures with Hidden Dynamic Process for Climate Analysis
Understanding climate dynamics requires going beyond correlations in observational data to uncover their underlying causal process. Latent drivers, such as atmospheric processes, play a critical role in temporal dynamics, while direct causal influences also exist among geographically proximate observed variables. Traditional Causal Representation Learning (CRL) typically focuses on latent factors but overlooks such observable-to-observable causal relations, limiting its applicability to climate analysis. In this paper, we introduce a unified framework that jointly uncovers (i) causal relations among observed variables and (ii) latent driving forces together with their interactions. We establish conditions under which both the hidden dynamic processes and the causal structure among observed variables are simultaneously identifiable from time-series data. Remarkably, our guarantees hold even in the nonparametric setting, leveraging contextual information to recover latent variables and causal relations. Building on these insights, we propose CaDRe (Causal Discovery and Representation learning), a time-series generative model with structural constraints that integrates CRL and causal discovery. Experiments on synthetic datasets validate our theoretical results. On real-world climate datasets, CaDRe not only delivers competitive forecasting accuracy but also recovers visualized causal graphs aligned with domain expertise, thereby offering interpretable insights into climate systems.
♻ ☆ Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation or query rejection mechanisms, they suffer from computational efficiency and sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate utilization of high-confidence outputs, whereas uncertain predictions trigger a slow refinement model for accuracy improvement. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. The framework establishes a scalable paradigm for deploying reliable LLMs in error-sensitive applications, effectively balancing accuracy and practical utility.
♻ ☆ Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks
Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN's effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN's ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.
comment: 10 pages, 5 figures, Accepted at IJCAI 2025. Proceedings: https://www.ijcai.org/proceedings/2025/0724.pdf
♻ ☆ LogAction: Consistent Cross-system Anomaly Detection through Logs via Active Domain Adaptation
Log-based anomaly detection is a essential task for ensuring the reliability and performance of software systems. However, the performance of existing anomaly detection methods heavily relies on labeling, while labeling a large volume of logs is highly challenging. To address this issue, many approaches based on transfer learning and active learning have been proposed. Nevertheless, their effectiveness is hindered by issues such as the gap between source and target system data distributions and cold-start problems. In this paper, we propose LogAction, a novel log-based anomaly detection model based on active domain adaptation. LogAction integrates transfer learning and active learning techniques. On one hand, it uses labeled data from a mature system to train a base model, mitigating the cold-start issue in active learning. On the other hand, LogAction utilize free energy-based sampling and uncertainty-based sampling to select logs located at the distribution boundaries for manual labeling, thus addresses the data distribution gap in transfer learning with minimal human labeling efforts. Experimental results on six different combinations of datasets demonstrate that LogAction achieves an average 93.01% F1 score with only 2% of manual labels, outperforming some state-of-the-art methods by 26.28%. Website: https://logaction.github.io
comment: The 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025
♻ ☆ Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning
Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approaches with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-one components, each treated as an independent expert, enabling fine-grained rank-one expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-one expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning benchmarks using CLIP and language models, analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness in enhancing CL with PTMs, and improving generalization while mitigating forgetting.
comment: Preprint
♻ ☆ Attention based End to end network for Offline Writer Identification on Word level data
Writer identification due to its widespread application in various fields has gained popularity over the years. In scenarios where optimum handwriting samples are available, whether they be in the form of a single line, a sentence, or an entire page, writer identification algorithms have demonstrated noteworthy levels of accuracy. However, in scenarios where only a limited number of handwritten samples are available, particularly in the form of word images, there is a significant scope for improvement. In this paper, we propose a writer identification system based on an attention-driven Convolutional Neural Network (CNN). The system is trained utilizing image segments, known as fragments, extracted from word images, employing a pyramid-based strategy. This methodology enables the system to capture a comprehensive representation of the data, encompassing both fine-grained details and coarse features across various levels of abstraction. These extracted fragments serve as the training data for the convolutional network, enabling it to learn a more robust representation compared to traditional convolution-based networks trained on word images. Additionally, the paper explores the integration of an attention mechanism to enhance the representational power of the learned features. The efficacy of the proposed algorithm is evaluated on three benchmark databases, demonstrating its proficiency in writer identification tasks, particularly in scenarios with limited access to handwriting data.
♻ ☆ Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures
Large language models (LLMs) are increasingly powering web-based applications, whose effectiveness relies on fine-tuning with large-scale instruction data. However, such data often contains valuable or sensitive information that limits its public sharing among business organizations. Federated learning (FL) enables collaborative fine-tuning of LLMs without accessing raw data. Existing approaches to federated LLM fine-tuning usually adopt a uniform model architecture, making it challenging to fit highly heterogeneous client-side data in varying domains and tasks, e.g., hospitals and financial institutions conducting federated fine-tuning may require different LLM architectures due to the distinct nature of their domains and tasks. To address this, we propose FedAMoLE, a lightweight personalized FL framework that enables data-driven heterogeneous model architectures. It features a heterogeneous mixture of low-rank adaptation (LoRA) experts module to aggregate architecturally heterogeneous models and a reverse selection-based expert assignment strategy to tailor model architectures for each client based on data distributions. Experiments across seven scenarios demonstrate that FedAMoLE improves client-side performance by an average of 5.97% over existing approaches while maintaining practical memory, communication, and computation overhead.
comment: On going work. Codes are released at https://github.com/zyc140345/FedAMoLE
♻ ☆ Long Chain-of-Thought Reasoning Across Languages
While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.
comment: v1 is a workshop version accepted to SCALR @ COLM 2025
♻ ☆ Knowledge-Driven Federated Graph Learning on Model Heterogeneity
Federated graph learning (FGL) has emerged as a promising paradigm for collaborative graph representation learning, enabling multiple parties to jointly train models while preserving data privacy. However, most existing approaches assume homogeneous client models and largely overlook the challenge of model-centric heterogeneous FGL (MHtFGL), which frequently arises in practice when organizations employ graph neural networks (GNNs) of different scales and architectures.Such architectural diversity not only undermines smooth server-side aggregation, which presupposes a unified representation space shared across clients' updates, but also further complicates the transfer and integration of structural knowledge across clients. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework. FedGKC introduces a lightweight Copilot Model on each client to facilitate knowledge exchange while local architectures are heterogeneous across clients, and employs two complementary mechanisms: Client-side Self-Mutual Knowledge Distillation, which transfers effective knowledge between local and copilot models through bidirectional distillation with multi-view perturbation; and Server-side Knowledge-Aware Model Aggregation, which dynamically assigns aggregation weights based on knowledge provided by clients. Extensive experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy gain of 3.74% over baselines in MHtFGL scenarios, while maintaining excellent performance in homogeneous settings.
♻ ☆ Task Vector Bases: A Unified and Scalable Framework for Compressed Task Arithmetic
Task arithmetic, representing downstream tasks through linear operations on task vectors, has emerged as a simple yet powerful paradigm for transferring knowledge across diverse settings. However, maintaining a large collection of task vectors introduces scalability challenges in both storage and computation. We propose Task Vector Bases, a framework compressing $T$ task vectors into $M < T$ basis vectors while preserving the functionality of task arithmetic. By representing each task vector as a structured linear combination of basis atoms, our approach supports standard operations such as addition, negation, as well as more advanced arithmetic ones. The framework is orthogonal to other efficiency-oriented improvements in task arithmetic and can be used in combination with them. We provide theoretical analysis showing that basis compression retains addition generalization guarantees and enables principled unlearning, with error bounds depending on reconstruction quality. Empirically, our proposed basis construction methods consistently outperform heuristic basis construction baselines and, in some cases, even surpass the performance of full task vector collections across diverse downstream applications while reducing storage and computational requirements. The code is available at https://github.com/uiuctml/TaskVectorBasis.
♻ ☆ AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema
Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.
♻ ☆ Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
comment: 26 pages, 11 figures
♻ ☆ Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models
In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation, ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, relation-reversed, and one-hop reasoned data. We then conduct a rigorous evaluation of 15 state-of-the-art methods across three datasets, revealing that unlearned models still recall paraphrased answers and retain target facts in their intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/PERMU.
♻ ☆ Solving Time-Fractional Partial Integro-Differential Equations Using Tensor Neural Network
In this paper, we propose a novel machine learning method based on adaptive tensor neural network subspace to solve linear time-fractional diffusion-wave equations and nonlinear time-fractional partial integro-differential equations. In this framework, the tensor neural network and Gauss-Jacobi quadrature are effectively combined to construct a universal numerical scheme for the temporal Caputo derivative with orders spanning $ (0,1)$ and $(1,2)$. Specifically, in order to effectively utilize Gauss-Jacobi quadrature to discretize Caputo derivatives, we design the tensor neural network function multiplied by the function $t^{\mu}$ where the power $\mu$ is selected according to the parameters of the equations at hand. Finally, some numerical examples are provided to validate the efficiency and accuracy of the proposed tensor neural network based machine learning method.
comment: 26 pages, 13 figures, 9 tables. Accepted for publication in the SIAM Journal on Scientific Computing
♻ ☆ Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency becomes an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, which utilizes a novel signaling mechanism: when part of the output finishes, the computation kernel sends a signal to trigger the communication of that part, while continuing the computation of the remaining part (interference-free computation). Consequently, the communication of the finished part and the computation of the remaining part can be overlapped. On top of the signaling mechanism, FlashOverlap comprises two key components: (1) the determination of the signaling timing to boost the overlap efficiency (tile-wise overlapping), and (2) a pre-communication reordering to create the contiguous address for finished data, enabling communication by simply calling NCCL APIs (communication agnosticism), and a post-communication reordering to correct the data order. Experiments show that FlashOverlap achieves up to 1.65x speedup through overlap, outperforming existing works in most cases. Code is available at https://github.com/infinigence/FlashOverlap.
comment: 18 pages, 16 figures, 5 tables, to be published in EuroSys'26
♻ ☆ Temporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting
We propose Temporal Conformal Prediction (TCP), a distribution-free framework for constructing well-calibrated prediction intervals in nonstationary time series. TCP couples a modern quantile forecaster with a split-conformal calibration layer on a rolling window and, in its TCP-RM variant, augments the conformal threshold with a single online Robbins-Monro (RM) offset to steer coverage toward a target level in real time. We benchmark TCP against GARCH, Historical Simulation, and a rolling Quantile Regression (QR) baseline across equities (S&P 500), cryptocurrency (Bitcoin), and commodities (Gold). Three results are consistent across assets. First, rolling QR yields the sharpest intervals but is materially under-calibrated (e.g., S&P 500: 83.2% vs. 95% target). Second, TCP (and TCP-RM) achieves near-nominal coverage across assets, with intervals that are wider than Historical Simulation in this evaluation (e.g., S&P 500: 5.21 vs. 5.06). Third, the RM update changes calibration and width only marginally at our default hyperparameters. Crisis-window visualizations around March 2020 show TCP/TCP-RM expanding and then contracting their interval bands promptly as volatility spikes and recedes, with red dots marking days where realized returns fall outside the reported 95% interval (miscoverage). A sensitivity study confirms robustness to window size and step-size choices. Overall, TCP provides a practical, theoretically grounded solution to calibrated uncertainty quantification under distribution shift, bridging statistical inference and machine learning for risk forecasting.
♻ ☆ Golden Ratio Weighting Prevents Model Collapse
Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies have become central challenges in generative model research. In this paper, we investigate this phenomenon within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data from the previous training step. To develop an optimal training strategy for integrating real and synthetic data, we evaluate the performance of a weighted training scheme in various scenarios, including Gaussian distribution estimation, generalized linear models, and nonparametric estimation. We theoretically characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model's performance. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression, revealing a fundamental trade-off between leveraging synthetic data and model performance. In some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio. Finally, we validate our theoretical results on extensive simulated datasets and a real tabular dataset.
♻ ☆ Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling
3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding. At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.
♻ ☆ SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation
We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.
comment: Technical report. 36 pages, including 11 pages of appendix
♻ ☆ More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
♻ ☆ The Curious Case of In-Training Compression of State Space Models
State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs \emph{during training}, where only dimensions of high influence are identified and preserved. Our approach, \textsc{CompreSSM}, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance. Project code is available at github.com/camail-official/compressm.
♻ ☆ PO-Flow: Flow-based Generative Models for Sampling Potential Outcomes and Counterfactuals
Predicting potential and counterfactual outcomes from observational data is central to clinical decision-making, where physicians must weigh treatments for an individual patient rather than relying solely on average effects at the population level. We propose PO-Flow, a continuous normalizing flow (CNF) framework for causal inference that jointly models potential outcomes and counterfactuals. Trained via flow matching, PO-Flow provides a unified approach to average treatment effect estimation, individualized potential outcome prediction, and counterfactual prediction. Besides, PO-Flow directly learns the densities of potential outcomes, enabling likelihood-based evaluation of predictions. Furthermore, PO-Flow explores counterfactual outcome generation conditioned on the observed factual in general observational datasets, with a supporting recovery result under certain assumptions. PO-Flow outperforms modern baselines across diverse datasets and causal tasks in the potential outcomes framework.
♻ ☆ LLMs on a Budget? Say HOLA
Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.
comment: Accepted at EMNLP 2025 (Industry Track)
♻ ☆ Mitigating Noise Detriment in Differentially Private Federated Learning with Model Pre-training
Differentially Private Federated Learning (DPFL) strengthens privacy protection by perturbing model gradients with noise, though at the cost of reduced accuracy. Although prior empirical studies indicate that initializing from pre-trained rather than random parameters can alleviate noise disturbance, the problem of optimally fine-tuning pre-trained models in DPFL remains unaddressed. In this paper, we propose Pretrain-DPFL, a framework that systematically evaluates three most representative fine-tuning strategies: full-tuning (FT), head-tuning (HT), and unified-tuning(UT) combining HT followed by FT. Through convergence analysis under smooth non-convex loss, we establish theoretical conditions for identifying the optimal fine-tuning strategy in Pretrain-DPFL, thereby maximizing the benefits of pre-trained models in mitigating noise disturbance. Extensive experiments across multiple datasets demonstrate Pretrain-DPFL's superiority, achieving $25.22\%$ higher accuracy than scratch training and outperforming the second-best baseline by $8.19\%$, significantly improving the privacy-utility trade-off in DPFL.
♻ ☆ Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations
Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.
♻ ☆ A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis
Foundation models have gained growing interest in the IoT domain due to their reduced reliance on labeled data and strong generalizability across tasks, which address key limitations of traditional machine learning approaches. However, most existing foundation model based methods are developed for specific IoT tasks, making it difficult to compare approaches across IoT domains and limiting guidance for applying them to new tasks. This survey aims to bridge this gap by providing a comprehensive overview of current methodologies and organizing them around four shared performance objectives by different domains: efficiency, context-awareness, safety, and security & privacy. For each objective, we review representative works, summarize commonly-used techniques and evaluation metrics. This objective-centric organization enables meaningful cross-domain comparisons and offers practical insights for selecting and designing foundation model based solutions for new IoT tasks. We conclude with key directions for future research to guide both practitioners and researchers in advancing the use of foundation models in IoT applications.
comment: Accepted by CCF Transactions on Pervasive Computing and Interaction (CCF TPCI)
♻ ☆ FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation
Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of bias.Our code is on https://github.com/youlei202/FairSHAP.
comment: 4 figures, 19 pages
♻ ☆ HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design
With Large Language Models (LLMs) recently demonstrating impressive proficiency in code generation, it is promising to extend their abilities to Hardware Description Language (HDL). However, LLMs tend to generate single HDL code blocks rather than hierarchical structures for hardware designs, leading to hallucinations, particularly in complex designs like Domain-Specific Accelerators (DSAs). To address this, we propose HiVeGen, a hierarchical LLM-based Verilog generation framework that decomposes generation tasks into LLM-manageable hierarchical submodules. HiVeGen further harnesses the advantages of such hierarchical structures by integrating automatic Design Space Exploration (DSE) into hierarchy-aware prompt generation, introducing weight-based retrieval to enhance code reuse, and enabling real-time human-computer interaction to lower error-correction cost, significantly improving the quality of generated designs.
♻ ☆ Efficient Multi Subject Visual Reconstruction from fMRI Using Aligned Representations
This work introduces a novel approach to fMRI-based visual image reconstruction using a subject-agnostic common representation space. We show that the brain signals of the subjects can be aligned in this common space during training to form a semantically aligned common brain. This is leveraged to demonstrate that aligning subject-specific lightweight modules to a reference subject is significantly more efficient than traditional end-to-end training methods. Our approach excels in low-data scenarios. We evaluate our methods on different datasets, demonstrating that the common space is subject and dataset-agnostic.
♻ ☆ High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training
The development of machine learning for cardiac care is severely hampered by privacy restrictions on sharing real patient electrocardiogram (ECG) data. Although generative AI offers a promising solution, the real-world use of existing model-synthesized ECGs is limited by persistent gaps in trustworthiness and clinical utility. In this work, we address two major shortcomings of current generative ECG methods: insufficient morphological fidelity and the inability to generate personalized, patient-specific physiological signals. To address these gaps, we build on a conditional diffusion-based Structured State Space Model (SSSD-ECG) with two principled innovations: (1) MIDT-ECG (Mel-Spectrogram Informed Diffusion Training), a novel training paradigm with time-frequency domain supervision to enforce physiological structural realism, and (2) multi-modal demographic conditioning to enable patient-specific synthesis. We comprehensively evaluate our approach on the PTB-XL dataset, assessing the synthesized ECG signals on fidelity, clinical coherence, privacy preservation, and downstream task utility. MIDT-ECG achieves substantial gains: it improves morphological coherence, preserves strong privacy guarantees with all metrics evaluated exceeding the baseline by 4-8%, and notably reduces the interlead correlation error by an average of 74%, while demographic conditioning enhances signal-to-noise ratio and personalization. In critical low-data regimes, a classifier trained on datasets supplemented with our synthetic ECGs achieves performance comparable to a classifier trained solely on real data. Together, we demonstrate that ECG synthesizers, trained with the proposed time-frequency structural regularization scheme, can serve as personalized, high-fidelity, privacy-preserving surrogates when real data are scarce, advancing the responsible use of generative AI in healthcare.
♻ ☆ Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
A class of generative models that unifies flow-based and diffusion-based methods is introduced. These models extend the framework proposed in Albergo and Vanden-Eijnden (2023), enabling the use of a broad class of continuous-time stochastic processes called stochastic interpolants to bridge any two probability density functions exactly in finite time. These interpolants are built by combining data from the two prescribed densities with an additional latent variable that shapes the bridge in a flexible way. The time-dependent density function of the interpolant is shown to satisfy a transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diffusion coefficient. Upon consideration of the time evolution of an individual sample, this viewpoint leads to both deterministic and stochastic generative models based on probability flow equations or stochastic differential equations with an adjustable level of noise. The drift coefficients entering these models are time-dependent velocity fields characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score. We show that minimization of these quadratic objectives leads to control of the likelihood for generative models built upon stochastic dynamics, while likelihood control for deterministic dynamics is more stringent. We also construct estimators for the likelihood and the cross entropy of interpolant-based generative models, and we discuss connections with other methods such as score-based diffusion models, stochastic localization, probabilistic denoising, and rectifying flows. In addition, we demonstrate that stochastic interpolants recover the Schr\"odinger bridge between the two target densities when explicitly optimizing over the interpolant. Finally, algorithmic aspects are discussed and the approach is illustrated on numerical examples.
comment: JMLR Version
♻ ☆ Intention-Conditioned Flow Occupancy Models
Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom
♻ ☆ SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection
Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(\ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N \times N$ pairwise similarities and explicit $N \times \ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.
♻ ☆ Advancing Automated Urban Planning: Exploring Algorithmic Approaches with Generative Artificial Intelligence
The two fields of urban planning and artificial intelligence (AI) arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we introduce the importance of urban planning from the sustainability, living, economic, disaster, and environmental perspectives. We review the fundamental concepts of urban planning and relate these concepts to crucial open problems of machine learning, including adversarial learning, generative neural networks, deep encoder-decoder networks, conversational AI, and geospatial and temporal machine learning, thereby assaying how AI can contribute to modern urban planning. Thus, a central problem is automated land-use configuration, which is formulated as the generation of land uses and building configuration for a target area from surrounding geospatial, human mobility, social media, environment, and economic activities. Finally, we delineate some implications of AI for urban planning and propose key research areas at the intersection of both topics.
♻ ☆ High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization WACV 2026
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
comment: WACV 2026: Winter Conference on Applications of Computer Vision 2026
♻ ☆ Making Bias Amplification in Balanced Datasets Directional and Interpretable
Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute A (e.g., gender) and a task T (e.g., cooking). However, these metrics fail to measure biases when A is balanced with T. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.
♻ ☆ Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.
♻ ☆ Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media
The ability to predict how shock waves traverse porous and architected materials is a key challenge in planetary defense and in the pursuit of inertial fusion energy. Yet capturing pore collapse, anomalous Hugoniot responses, and localized heating - phenomena that strongly influence asteroid deflection or fusion ignition - has remained a major challenge despite recent advances in single-field and reduced representations. We introduce a multi-field spatio-temporal model (MSTM) that unifies seven coupled fields - pressure, density, temperature, energy, material distribution, and two velocity components - into a single autoregressive surrogate. Trained on high-fidelity hydrocode data, MSTM captures nonlinear shock-driven dynamics across porous and architected configurations, achieving mean errors of 1.4% and 3.2% respectively, all while delivering over three orders of magnitude in speedup. MSTM reduces mean-squared error and structural dissimilarity by 94% relative torelative to single-field spatio-temporal models. This advance transforms problems once considered intractable into tractable design studies, establishing a practical framework for optimizing meso-structured materials in planetary impact mitigation and inertial fusion energy.
comment: 19 pages, 12 figures
♻ ☆ Training-free AI for Earth Observation Change Detection using Physics Aware Neuromorphic Networks
Earth observations from low Earth orbit satellites provide vital information for decision makers to better manage time-sensitive events such as natural disasters. For the data to be most effective for first responders, low latency is required between data capture and its arrival to decision makers. A major bottleneck is in the bandwidth-limited downlinking of the data from satellites to ground stations. One approach to overcome this challenge is to process at least some of the data on-board and prioritise pertinent data to be downlinked. In this work we propose a Physics Aware Neuromorphic Network (PANN) to detect changes caused by natural disasters from a sequence of multi-spectral satellite images and produce a change map, enabling relevant data to be prioritised for downlinking. The PANN used in this study is motivated by physical neural networks comprised of nano-electronic circuit elements known as "memristors" (nonlinear resistors with memory). The weights in the network are dynamic and update in response to varying input signals according to memristor equations of state and electrical circuit conservation laws. The PANN thus generates physics-constrained dynamical output features which are used to detect changes in a natural disaster detection task by applying a distance-based metric. Importantly, this makes the whole model training-free, allowing it to be implemented with minimal computing resources. The PANN was benchmarked against a state-of-the-art AI model and achieved comparable or better results in each natural disaster category. It thus presents a promising solution to the challenge of resource-constrained on-board processing.
comment: 16 pages, 8 figures, 4 tables
♻ ☆ MergeBench: A Benchmark for Merging Domain-Specialized LLMs NeurIPS 2025
Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at \href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}.
comment: NeurIPS 2025 Track on Datasets and Benchmarks
♻ ☆ Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features
Large language models have recently made great strides in reasoning task performance through chain-of-thought (CoT) strategies trained via reinforcement learning; however, these "reasoning large language models" (RLLMs) remain imperfect reasoners, and understanding the frequencies and causes of their failure modes is important for both users and developers. We test o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and Grok 3 Mini Beta on graph coloring as a variable-complexity constraint-satisfaction logic problem, and find evidence from both error rate comparisons and CoT/explanation text analysis that RLLMs are prone to hallucinate graph edges not specified in the prompt. This phenomenon persists across multiple problem complexity levels and semantic frames, and it appears to account for a significant fraction of the incorrect answers from every tested model, and the vast majority of them for some models. We also validate the generalizability of this input-conflicting hallucination phenomenon with smaller-scale experiments on a type of stable matching problem. Our results indicate that RLLMs may possess broader issues with misrepresentation of problem specifics, and we offer suggestions for design choices to mitigate this weakness.
comment: 19 pages (9 excluding references and appendices); 9 figures (6 excluding appendices)
♻ ☆ COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but this leads to substantial approximation errors, resulting in less effective optimization (e.g., in terms of per-token efficiency). In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS. Our code is available at https://github.com/lliu606/COSMOS.
comment: 23 pages, 9 figures, 6 tables
♻ ☆ From Federated Learning to X-Learning: Breaking the Barriers of Decentrality Through Random Walks
We provide our perspective on X-Learning (XL), a novel distributed learning architecture that generalizes and extends the concept of decentralization. Our goal is to present a vision for XL, introducing its unexplored design considerations and degrees of freedom. To this end, we shed light on the intuitive yet non-trivial connections between XL, graph theory, and Markov chains. We also present a series of open research directions to stimulate further research.
comment: 6 figures
♻ ☆ Partition Generative Modeling: Masked Modeling Without Masks
Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry no information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the Partition Generative Model (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least $5\times$ improvements in sampling latency and throughput, while producing samples with superior Generative Perplexity, compared to Masked Diffusion Language Models. On ImageNet, PGMs achieve a $7.5\times$ higher throughput than MaskGIT, with only a slight increase in FID (5.54 vs. 5.35). With twice as many sampling steps, the FID reduces to 4.56 while while being $3.9\times$ faster than MaskGIT. Finally, PGMs integrate seamlessly with MGM distillation, providing further inference speedups.
♻ ☆ A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis
Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of "positive" and "negative" samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.
comment: 38 pages, 8 figures, survey paper
♻ ☆ Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an \textit{adaptive memory} mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.
♻ ☆ Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning
A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.
♻ ☆ Lizard: An Efficient Linearization Framework for Large Language Models
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.
comment: 13 pages
♻ ☆ Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance
The stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases, which is confirmed by simulation studies. Additionally, we offer practical guidance on using SGD, supported by theoretical analysis and numerical evidence. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable.
♻ ☆ Revealing Interconnections between Diseases: from Statistical Methods to Large Language Models
Identifying disease interconnections through manual analysis of large-scale clinical data is labor-intensive, subjective, and prone to expert disagreement. While machine learning (ML) shows promise, three critical challenges remain: (1) selecting optimal methods from the vast ML landscape, (2) determining whether real-world clinical data (e.g., electronic health records, EHRs) or structured disease descriptions yield more reliable insights, (3) the lack of "ground truth," as some disease interconnections remain unexplored in medicine. Large language models (LLMs) demonstrate broad utility, yet they often lack specialized medical knowledge. To address these gaps, we conduct a systematic evaluation of seven approaches for uncovering disease relationships based on two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the full set of ICD-10 codes, both with and without textual descriptions. Our framework integrates the following: (i) a statistical co-occurrence analysis and a masked language modeling (MLM) approach using real clinical data; (ii) domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral, DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained interconnection matrices shows that the LLM-based approach produces interconnections with the lowest diversity of ICD code connections to different diseases compared to other methods, including text-based and domain-based approaches. This suggests an important implication: LLMs have limited potential for discovering new interconnections. In the absence of ground truth databases for medical interconnections between ICD codes, our results constitute a valuable medical disease ontology that can serve as a foundational resource for future clinical research and artificial intelligence applications in healthcare.
Computer Vision and Pattern Recognition 169
☆ Temporal Prompting Matters: Rethinking Referring Video Object Segmentation
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.
☆ Quantum-enhanced Computer Vision: Going Beyond Classical Algorithms
Quantum-enhanced Computer Vision (QeCV) is a new research field at the intersection of computer vision, optimisation theory, machine learning and quantum computing. It has high potential to transform how visual signals are processed and interpreted with the help of quantum computing that leverages quantum-mechanical effects in computations inaccessible to classical (i.e. non-quantum) computers. In scenarios where existing non-quantum methods cannot find a solution in a reasonable time or compute only approximate solutions, quantum computers can provide, among others, advantages in terms of better time scalability for multiple problem classes. Parametrised quantum circuits can also become, in the long term, a considerable alternative to classical neural networks in computer vision. However, specialised and fundamentally new algorithms must be developed to enable compatibility with quantum hardware and unveil the potential of quantum computational paradigms in computer vision. This survey contributes to the existing literature on QeCV with a holistic review of this research field. It is designed as a quantum computing reference for the computer vision community, targeting computer vision students, scientists and readers with related backgrounds who want to familiarise themselves with QeCV. We provide a comprehensive introduction to QeCV, its specifics, and methodologies for formulations compatible with quantum hardware and QeCV methods, leveraging two main quantum computational paradigms, i.e. gate-based quantum computing and quantum annealing. We elaborate on the operational principles of quantum computers and the available tools to access, program and simulate them in the context of QeCV. Finally, we review existing quantum computing tools and learning materials and discuss aspects related to publishing and reviewing QeCV papers, open challenges and potential social implications.
comment: 44 pages, 23 figures and 6 tables
Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers NeurIPS 2025
This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.
comment: NeurIPS 2025. Project page: https://pixel-perfect-depth.github.io/
☆ WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation
Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.
☆ MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.
comment: Project Page is available at: https://cvlab-kaist.github.io/MATRIX/
☆ SpecGuard: Spectral Projection-based Advanced Invisible Watermarking ICCV 2025
Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval's theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark's invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models. To ensure reproducibility, the full code is released on \href{https://github.com/inzamamulDU/SpecGuard_ICCV_2025}{\textcolor{blue}{\textbf{GitHub}}}.
comment: ICCV 2025 Accepted Paper
☆ Evaluating Fundus-Specific Foundation Models for Diabetic Macular Edema Detection
Diabetic Macular Edema (DME) is a leading cause of vision loss among patients with Diabetic Retinopathy (DR). While deep learning has shown promising results for automatically detecting this condition from fundus images, its application remains challenging due the limited availability of annotated data. Foundation Models (FM) have emerged as an alternative solution. However, it is unclear if they can cope with DME detection in particular. In this paper, we systematically compare different FM and standard transfer learning approaches for this task. Specifically, we compare the two most popular FM for retinal images--RETFound and FLAIR--and an EfficientNet-B0 backbone, across different training regimes and evaluation settings in IDRiD, MESSIDOR-2 and OCT-and-Eye-Fundus-Images (OEFI). Results show that despite their scale, FM do not consistently outperform fine-tuned CNNs in this task. In particular, an EfficientNet-B0 ranked first or second in terms of area under the ROC and precision/recall curves in most evaluation settings, with RETFound only showing promising results in OEFI. FLAIR, on the other hand, demonstrated competitive zero-shot performance, achieving notable AUC-PR scores when prompted appropriately. These findings reveal that FM might not be a good tool for fine-grained ophthalmic tasks such as DME detection even after fine-tuning, suggesting that lightweight CNNs remain strong baselines in data-scarce environments.
comment: Accepted for publication at SIPAIM 2025
☆ TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
comment: Project page: https://talkcuts.github.io/
☆ GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation
Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at https://github.com/27yw/GenPilot.
comment: 30 pages, 21 figures, accepted to EMNLP 2025 findings
☆ EigenScore: OOD Detection using Covariance in Diffusion Models
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have recently emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection. We propose EigenScore, a new OOD detection method that leverages the eigenvalue spectrum of the posterior covariance induced by a diffusion model. We argue that posterior covariance provides a consistent signal of distribution shift, leading to larger trace and leading eigenvalues on OOD inputs, yielding a clear spectral signature. We further provide analysis explicitly linking posterior covariance to distribution mismatch, establishing it as a reliable signal for OOD detection. To ensure tractability, we adopt a Jacobian-free subspace iteration method to estimate the leading eigenvalues using only forward evaluations of the denoiser. Empirically, EigenScore achieves SOTA performance, with up to 5% AUROC improvement over the best baseline. Notably, it remains robust in near-OOD settings such as CIFAR-10 vs CIFAR-100, where existing diffusion-based methods often fail.
☆ Resolution scaling governs DINOv3 transfer performance in chest radiograph classification
Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta's DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.
☆ MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis SIGGRAPH
Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer's state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.
comment: Accepted by SIGGRAPH Asia 2025 conference track
☆ TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
comment: 9 pages, 6 figures
☆ Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.
☆ Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models
Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs
☆ TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.
comment: Project page: https://pku-epic.github.io/TrackVLA-plus-plus-Web/
☆ Graph Conditioned Diffusion for Controllable Histopathology Image Generation
Recent advances in Diffusion Probabilistic Models (DPMs) have set new standards in high-quality image synthesis. Yet, controlled generation remains challenging, particularly in sensitive areas such as medical imaging. Medical images feature inherent structure such as consistent spatial arrangement, shape or texture, all of which are critical for diagnosis. However, existing DPMs operate in noisy latent spaces that lack semantic structure and strong priors, making it difficult to ensure meaningful control over generated content. To address this, we propose graph-based object-level representations for Graph-Conditioned-Diffusion. Our approach generates graph nodes corresponding to each major structure in the image, encapsulating their individual features and relationships. These graph representations are processed by a transformer module and integrated into a diffusion model via the text-conditioning mechanism, enabling fine-grained control over generation. We evaluate this approach using a real-world histopathology use case, demonstrating that our generated data can reliably substitute for annotated patient data in downstream segmentation tasks. The code is available here.
☆ Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Deep learning (DL) has been increasingly applied in medical imaging, however, it requires large amounts of data, which raises many challenges related to data privacy, storage, and transfer. Federated learning (FL) is a training paradigm that overcomes these issues, though its effectiveness may be reduced when dealing with non-independent and identically distributed (non-IID) data. This study simulates non-IID conditions by applying different MRI intensity normalization techniques to separate data subsets, reflecting a common cause of heterogeneity. These subsets are then used for training and testing models for brain tumor segmentation. The findings provide insights into the influence of the MRI intensity normalization methods on segmentation models, both training and inference. Notably, the FL methods demonstrated resilience to inconsistently normalized data across clients, achieving the 3D Dice score of 92%, which is comparable to a centralized model (trained using all data). These results indicate that FL is a solution to effectively train high-performing models without violating data privacy, a crucial concern in medical applications. The code is available at: https://github.com/SanoScience/fl-varying-normalization.
☆ MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency
Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.
☆ Enhancing Concept Localization in CLIP-based Concept Bottleneck Models
This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination, incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.
☆ DADO: A Depth-Attention framework for Object Discovery
Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.
comment: 21st International Conference in Computer Analysis of Images and Patterns (CAIP 2025)
☆ Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention. By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments. This generalisation capability is expected to enable robots to solve novel downstream tasks with minimal or no additional task-specific data, facilitating more flexible and scalable real-world deployment. Unlike previous surveys that focus narrowly on action representations or high-level model architectures, this work offers a comprehensive, full-stack review, integrating both software and hardware components of VLA systems. In particular, this paper provides a systematic review of VLAs, covering their strategy and architectural transition, architectures and building blocks, modality-specific processing techniques, and learning paradigms. In addition, to support the deployment of VLAs in real-world robotic applications, we also review commonly used robot platforms, data collection strategies, publicly available datasets, data augmentation methods, and evaluation benchmarks. Throughout this comprehensive survey, this paper aims to offer practical guidance for the robotics community in applying VLAs to real-world robotic systems. All references categorized by training approach, evaluation method, modality, and dataset are available in the table on our project website: https://vla-survey.github.io .
comment: Accepted to IEEE Access, website: https://vla-survey.github.io
☆ Concept Retrieval -- What and How?
A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/
☆ Introspection in Learned Semantic Scene Graph Localisation
This work investigates how semantics influence localisation performance and robustness in a learned self-supervised, contrastive semantic localisation framework. After training a localisation network on both original and perturbed maps, we conduct a thorough post-hoc introspection analysis to probe whether the model filters environmental noise and prioritises distinctive landmarks over routine clutter. We validate various interpretability methods and present a comparative reliability analysis. Integrated gradients and Attention Weights consistently emerge as the most reliable probes of learned behaviour. A semantic class ablation further reveals an implicit weighting in which frequent objects are often down-weighted. Overall, the results indicate that the model learns noise-robust, semantically salient relations about place definition, thereby enabling explainable registration under challenging visual and structural variations.
comment: IEEE IROS 2025 Workshop FAST
☆ U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking
Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench.
comment: 54 pages. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench
☆ Sharpness-Aware Data Generation for Zero-shot Quantization
Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing the full-precision model. While it is well-known that deep neural networks with low sharpness have better generalization ability, none of the previous zero-shot quantization works considers the sharpness of the quantized model as a criterion for generating training data. This paper introduces a novel methodology that takes into account quantized model sharpness in synthetic data generation to enhance generalization. Specifically, we first demonstrate that sharpness minimization can be attained by maximizing gradient matching between the reconstruction loss gradients computed on synthetic and real validation data, under certain assumptions. We then circumvent the problem of the gradient matching without real validation set by approximating it with the gradient matching between each generated sample and its neighbors. Experimental evaluations on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed method over the state-of-the-art techniques in low-bit quantization settings.
☆ Bayesian Modelling of Multi-Year Crop Type Classification Using Deep Neural Networks and Hidden Markov Models
The temporal consistency of yearly land-cover maps is of great importance to model the evolution and change of the land cover over the years. In this paper, we focus the attention on a novel approach to classification of yearly satellite image time series (SITS) that combines deep learning with Bayesian modelling, using Hidden Markov Models (HMMs) integrated with Transformer Encoder (TE) based DNNs. The proposed approach aims to capture both i) intricate temporal correlations in yearly SITS and ii) specific patterns in multiyear crop type sequences. It leverages the cascade classification of an HMM layer built on top of the TE, discerning consistent yearly crop-type sequences. Validation on a multiyear crop type classification dataset spanning 47 crop types and six years of Sentinel-2 acquisitions demonstrates the importance of modelling temporal consistency in the predicted labels. HMMs enhance the overall performance and F1 scores, emphasising the effectiveness of the proposed approach.
comment: 5 pages, 1 figure, accepted conference paper at IEEE International Geoscience and Remote Sensing Symposium, 7-12 July 2024, Athens, Greece
☆ No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
☆ Revisiting Mixout: An Overlooked Path to Robust Finetuning
Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.
☆ Addressing the ID-Matching Challenge in Long Video Captioning
Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.
☆ Learning Global Representation from Queries for Vectorized HD Map Construction
The online construction of vectorized high-definition (HD) maps is a cornerstone of modern autonomous driving systems. State-of-the-art approaches, particularly those based on the DETR framework, formulate this as an instance detection problem. However, their reliance on independent, learnable object queries results in a predominantly local query perspective, neglecting the inherent global representation within HD maps. In this work, we propose \textbf{MapGR} (\textbf{G}lobal \textbf{R}epresentation learning for HD \textbf{Map} construction), an architecture designed to learn and utilize a global representations from queries. Our method introduces two synergistic modules: a Global Representation Learning (GRL) module, which encourages the distribution of all queries to better align with the global map through a carefully designed holistic segmentation task, and a Global Representation Guidance (GRG) module, which endows each individual query with explicit, global-level contextual information to facilitate its optimization. Evaluations on the nuScenes and Argoverse2 datasets validate the efficacy of our approach, demonstrating substantial improvements in mean Average Precision (mAP) compared to leading baselines.
comment: 16 pages
☆ Generating Surface for Text-to-3D using 2D Gaussian Splatting
Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.
☆ High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization WACV 2026
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
comment: WACV 2026: Winter Conference on Applications of Computer Vision 2026
☆ OBJVanish: Physically Realizable Text-to-3D Adv. Generation of LiDAR-Invisible Objects
LiDAR-based 3D object detectors are fundamental to autonomous driving, where failing to detect objects poses severe safety risks. Developing effective 3D adversarial attacks is essential for thoroughly testing these detection systems and exposing their vulnerabilities before real-world deployment. However, existing adversarial attacks that add optimized perturbations to 3D points have two critical limitations: they rarely cause complete object disappearance and prove difficult to implement in physical environments. We introduce the text-to-3D adversarial generation method, a novel approach enabling physically realizable attacks that can generate 3D models of objects truly invisible to LiDAR detectors and be easily realized in the real world. Specifically, we present the first empirical study that systematically investigates the factors influencing detection vulnerability by manipulating the topology, connectivity, and intensity of individual pedestrian 3D models and combining pedestrians with multiple objects within the CARLA simulation environment. Building on the insights, we propose the physically-informed text-to-3D adversarial generation (Phy3DAdvGen) that systematically optimizes text prompts by iteratively refining verbs, objects, and poses to produce LiDAR-invisible pedestrians. To ensure physical realizability, we construct a comprehensive object pool containing 13 3D models of real objects and constrain Phy3DAdvGen to generate 3D objects based on combinations of objects in this set. Extensive experiments demonstrate that our approach can generate 3D pedestrians that evade six state-of-the-art (SOTA) LiDAR 3D detectors in both CARLA simulation and physical environments, thereby highlighting vulnerabilities in safety-critical applications.
☆ IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction
Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.
☆ Label-frugal satellite image change detection with generative virtual exemplar learning
Change detection is a major task in remote sensing which consists in finding all the occurrences of changes in multi-temporal satellite or aerial images. The success of existing methods, and particularly deep learning ones, is tributary to the availability of hand-labeled training data that capture the acquisition conditions and the subjectivity of the user (oracle). In this paper, we devise a novel change detection algorithm, based on active learning. The main contribution of our work resides in a new model that measures how important is each unlabeled sample, and provides an oracle with only the most critical samples (also referred to as virtual exemplars) for further labeling. These exemplars are generated, using an invertible graph convnet, as the optimum of an adversarial loss that (i) measures representativity, diversity and ambiguity of the data, and thereby (ii) challenges (the most) the current change detection criteria, leading to a better re-estimate of these criteria in the subsequent iterations of active learning. Extensive experiments show the positive impact of our label-efficient learning model against comparative methods.
☆ Angular Constraint Embedding via SpherePair Loss for Constrained Clustering NeurIPS 2025
Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \href{https://github.com/spherepaircc/SpherePairCC/tree/main}{our repository}.
comment: Accepted by NeurIPS 2025, 6 Figures and 1 Table in Main text, 18 Figures and 5 Tables in Appendices
☆ Lung Infection Severity Prediction Using Transformers with Conditional TransMix Augmentation and Cross-Attention
Lung infections, particularly pneumonia, pose serious health risks that can escalate rapidly, especially during pandemics. Accurate AI-based severity prediction from medical imaging is essential to support timely clinical decisions and optimize patient outcomes. In this work, we present a novel method applicable to both CT scans and chest X-rays for assessing lung infection severity. Our contributions are twofold: (i) QCross-Att-PVT, a Transformer-based architecture that integrates parallel encoders, a cross-gated attention mechanism, and a feature aggregator to capture rich multi-scale features; and (ii) Conditional Online TransMix, a custom data augmentation strategy designed to address dataset imbalance by generating mixed-label image patches during training. Evaluated on two benchmark datasets, RALO CXR and Per-COVID-19 CT, our method consistently outperforms several state-of-the-art deep learning models. The results emphasize the critical role of data augmentation and gated attention in improving both robustness and predictive accuracy. This approach offers a reliable, adaptable tool to support clinical diagnosis, disease monitoring, and personalized treatment planning. The source code of this work is available at https://github.com/bouthainas/QCross-Att-PVT.
☆ HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation
LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24$\times$ faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt
comment: Accepted at IROS 2025 (IEEE/RSJ International Conference on Intelligent Robots and Systems)
☆ SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.
☆ Explaining raw data complexity to improve satellite onboard processing
With increasing processing power, deploying AI models for remote sensing directly onboard satellites is becoming feasible. However, new constraints arise, mainly when using raw, unprocessed sensor data instead of preprocessed ground-based products. While current solutions primarily rely on preprocessed sensor images, few approaches directly leverage raw data. This study investigates the effects of utilising raw data on deep learning models for object detection and classification tasks. We introduce a simulation workflow to generate raw-like products from high-resolution L1 imagery, enabling systemic evaluation. Two object detection models (YOLOv11s and YOLOX-S) are trained on both raw and L1 datasets, and their performance is compared using standard detection metrics and explainability tools. Results indicate that while both models perform similarly at low to medium confidence thresholds, the model trained on raw data struggles with object boundary identification at high confidence levels. It suggests that adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.
comment: Preprint: European Data Handling & Data Processing Conference (EDHPC) 2025
☆ Online Generic Event Boundary Detection ICCV 2025
Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.
comment: ICCV 2025
☆ Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization ECCV 2024
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at https://github.com/ZhouKanglei/MAGRPP.
comment: Extended Version of MAGR (ECCV 2024 Oral Presentation)
☆ Lattice-allocated Real-time Line Segment Feature Detection and Tracking Using Only an Event-based Camera ICCV
Line segment extraction is effective for capturing geometric features of human-made environments. Event-based cameras, which asynchronously respond to contrast changes along edges, enable efficient extraction by reducing redundant data. However, recent methods often rely on additional frame cameras or struggle with high event rates. This research addresses real-time line segment detection and tracking using only a modern, high-resolution (i.e., high event rate) event-based camera. Our lattice-allocated pipeline consists of (i) velocity-invariant event representation, (ii) line segment detection based on a fitting score, (iii) and line segment tracking by perturbating endpoints. Evaluation using ad-hoc recorded dataset and public datasets demonstrates real-time performance and higher accuracy compared to state-of-the-art event-only and event-frame hybrid baselines, enabling fully stand-alone event camera operation in real-world settings.
comment: 12 pages, 13 figures, 6 tables, ICCV Workshop NeVi2025
☆ StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance ICCV 2025
In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts. Our code is available \href{https://github.com/naver-ai/StyleKeeper}{here}.
comment: Accepted to ICCV 2025; CVPRW AI4CC 2024 (Best Paper + Oral)
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.
comment: preprint
☆ VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance
Echocardiography is a critical tool for detecting heart diseases. Recently, ultrasound foundation models have demonstrated remarkable capabilities in cardiac ultrasound image analysis. However, obtaining high-quality ultrasound images is a prerequisite for accurate diagnosis. Due to the exceptionally high operational difficulty of cardiac ultrasound, there is a shortage of highly skilled personnel, which hinders patients from receiving timely examination services. In this paper, we aim to adapt the medical knowledge learned by foundation models from vast datasets to the probe guidance task, which is designed to provide real-time operational recommendations for junior sonographers to acquire high-quality ultrasound images. Moreover, inspired by the practice where experts optimize action decisions based on past explorations, we meticulously design a parameter-efficient Vision-Action Adapter (VA-Adapter) to enable foundation model's image encoder to encode vision-action sequences, thereby enhancing guidance performance. With built-in sequential reasoning capabilities in a compact design, the VA-Adapter enables a pre-trained ultrasound foundation model to learn precise probe adjustment strategies by fine-tuning only a small subset of parameters. Extensive experiments demonstrate that the VA-Adapter can surpass strong probe guidance models. Our code will be released after acceptance.
☆ Capture and Interact: Rapid 3D Object Acquisition and Rendering with Gaussian Splatting in Unity
Capturing and rendering three-dimensional (3D) objects in real time remain a significant challenge, yet hold substantial potential for applications in augmented reality, digital twin systems, remote collaboration and prototyping. We present an end-to-end pipeline that leverages 3D Gaussian Splatting (3D GS) to enable rapid acquisition and interactive rendering of real-world objects using a mobile device, cloud processing and a local computer. Users scan an object with a smartphone video, upload it for automated 3D reconstruction, and visualize it interactively in Unity at an average of 150 frames per second (fps) on a laptop. The system integrates mobile capture, cloud-based 3D GS and Unity rendering to support real-time telepresence. Our experiments show that the pipeline processes scans in approximately 10 minutes on a graphics processing unit (GPU) achieving real-time rendering on the laptop.
☆ Extreme Amodal Face Detection
Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.
☆ Bionetta: Efficient Client-Side Zero-Knowledge Machine Learning Proving
In this report, we compare the performance of our UltraGroth-based zero-knowledge machine learning framework Bionetta to other tools of similar purpose such as EZKL, Lagrange's deep-prove, or zkml. The results show a significant boost in the proving time for custom-crafted neural networks: they can be proven even on mobile devices, enabling numerous client-side proving applications. While our scheme increases the cost of one-time preprocessing steps, such as circuit compilation and generating trusted setup, our approach is, to the best of our knowledge, the only one that is deployable on the native EVM smart contracts without overwhelming proof size and verification overheads.
☆ TTRV: Test-Time Reinforcement Learning for Vision Language Models
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
☆ GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V, for difficult image instances, where GPT-4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here: https://osf.io/u78td/?view_only=6b075584311f48e991c39335c840ded3; the Google Drive materials are here:https://drive.google.com/file/d/1ll8WWZDf7cCNcfNWrLViWt8GwDNSvVrp/view.
☆ A deep multiple instance learning approach based on coarse labels for high-resolution land-cover mapping
The quantity and the quality of the training labels are central problems in high-resolution land-cover mapping with machine-learning-based solutions. In this context, weak labels can be gathered in large quantities by leveraging on existing low-resolution or obsolete products. In this paper, we address the problem of training land-cover classifiers using high-resolution imagery (e.g., Sentinel-2) and weak low-resolution reference data (e.g., MODIS -derived land-cover maps). Inspired by recent works in Deep Multiple Instance Learning (DMIL), we propose a method that trains pixel-level multi-class classifiers and predicts low-resolution labels (i.e., patch-level classification), where the actual high-resolution labels are learned implicitly without direct supervision. This is achieved with flexible pooling layers that are able to link the semantics of the pixels in the high-resolution imagery to the low-resolution reference labels. Then, the Multiple Instance Learning (MIL) problem is re-framed in a multi-class and in a multi-label setting. In the former, the low-resolution annotation represents the majority of the pixels in the patch. In the latter, the annotation only provides us information on the presence of one of the land-cover classes in the patch and thus multiple labels can be considered valid for a patch at a time, whereas the low-resolution labels provide us only one label. Therefore, the classifier is trained with a Positive-Unlabeled Learning (PUL) strategy. Experimental results on the 2020 IEEE GRSS Data Fusion Contest dataset show the effectiveness of the proposed framework compared to standard training strategies.
comment: 14 pages, 4 figures, accepted conference paper at SPIE REMOTE SENSING, 3-7 September 2023, Amsterdam, Netherlands
☆ Transforming Noise Distributions with Histogram Matching: Towards a Single Denoiser for All
Supervised Gaussian denoisers exhibit limited generalization when confronted with out-of-distribution noise, due to the diverse distributional characteristics of different noise types. To bridge this gap, we propose a histogram matching approach that transforms arbitrary noise towards a target Gaussian distribution with known intensity. Moreover, a mutually reinforcing cycle is established between noise transformation and subsequent denoising. This cycle progressively refines the noise to be converted, making it approximate the real noise, thereby enhancing the noise transformation effect and further improving the denoising performance. We tackle specific noise complexities: local histogram matching handles signal-dependent noise, intrapatch permutation processes channel-related noise, and frequency-domain histogram matching coupled with pixel-shuffle down-sampling breaks spatial correlation. By applying these transformations, a single Gaussian denoiser gains remarkable capability to handle various out-of-distribution noises, including synthetic noises such as Poisson, salt-and-pepper and repeating pattern noises, as well as complex real-world noises. Extensive experiments demonstrate the superior generalization and effectiveness of our method.
comment: 12 pages
☆ UniFField: A Generalizable Unified Neural Feature Field for Visual, Semantic, and Spatial Uncertainties in Any Scene
Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.
comment: Project website: https://sites.google.com/view/uniffield
☆ OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot
Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.
☆ DeRainMamba: A Frequency-Aware State Space Model with Detail Enhancement for Image Deraining
Image deraining is crucial for improving visual quality and supporting reliable downstream vision tasks. Although Mamba-based models provide efficient sequence modeling, their limited ability to capture fine-grained details and lack of frequency-domain awareness restrict further improvements. To address these issues, we propose DeRainMamba, which integrates a Frequency-Aware State-Space Module (FASSM) and Multi-Directional Perception Convolution (MDPConv). FASSM leverages Fourier transform to distinguish rain streaks from high-frequency image details, balancing rain removal and detail preservation. MDPConv further restores local structures by capturing anisotropic gradient features and efficiently fusing multiple convolution branches. Extensive experiments on four public benchmarks demonstrate that DeRainMamba consistently outperforms state-of-the-art methods in PSNR and SSIM, while requiring fewer parameters and lower computational costs. These results validate the effectiveness of combining frequency-domain modeling and spatial detail enhancement within a state-space framework for single image deraining.
comment: accepted by IEEE SPL
☆ Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities
Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.
comment: The First Workshop on Natural Language Processing and Language Models for Digital Humanities (LM4DH 2025). RANLP 2025
☆ SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis
Persistent dynamic scene modeling for tracking and novel-view synthesis remains challenging due to the difficulty of capturing accurate deformations while maintaining computational efficiency. We propose SCas4D, a cascaded optimization framework that leverages structural patterns in 3D Gaussian Splatting for dynamic scenes. The key idea is that real-world deformations often exhibit hierarchical patterns, where groups of Gaussians share similar transformations. By progressively refining deformations from coarse part-level to fine point-level, SCas4D achieves convergence within 100 iterations per time frame and produces results comparable to existing methods with only one-twentieth of the training iterations. The approach also demonstrates effectiveness in self-supervised articulated object segmentation, novel view synthesis, and dense point tracking tasks.
comment: Published in Transactions on Machine Learning Research (06/2025)
☆ Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion
Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; however, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. To address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi-modal light field point-cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential reconstruction of point-cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image-only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud-only segmentation by 2.38 mIoU, demonstrating its effectiveness.
☆ DreamOmni2: Multimodal Instruction-based Editing and Generation
Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
☆ Heptapod: Language Modeling on Visual Signals
We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbf{causal attention}, \textbf{eliminates reliance on CFG}, and \textbf{eschews the trend of semantic tokenizers}. Our key innovation is \textit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.
☆ Automated Neural Architecture Design for Industrial Defect Detection
Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code will be available at https://github.com/Yuxi104/AutoNAD.
☆ The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators
A core challenge in scientific machine learning, and scientific computing more generally, is modeling continuous phenomena which (in practice) are represented discretely. Machine-learned operators (MLOs) have been introduced as a means to achieve this modeling goal, as this class of architecture can perform inference at arbitrary resolution. In this work, we evaluate whether this architectural innovation is sufficient to perform "zero-shot super-resolution," namely to enable a model to serve inference on higher-resolution data than that on which it was originally trained. We comprehensively evaluate both zero-shot sub-resolution and super-resolution (i.e., multi-resolution) inference in MLOs. We decouple multi-resolution inference into two key behaviors: 1) extrapolation to varying frequency information; and 2) interpolating across varying resolutions. We empirically demonstrate that MLOs fail to do both of these tasks in a zero-shot manner. Consequently, we find MLOs are not able to perform accurate inference at resolutions different from those on which they were trained, and instead they are brittle and susceptible to aliasing. To address these failure modes, we propose a simple, computationally-efficient, and data-driven multi-resolution training protocol that overcomes aliasing and that provides robust multi-resolution generalization.
☆ StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.
☆ Control-Augmented Autoregressive Diffusion for Data Assimilation
Despite recent advances in test-time scaling and finetuning of diffusion models, guidance in Auto-Regressive Diffusion Models (ARDMs) remains underexplored. We introduce an amortized framework that augments pretrained ARDMs with a lightweight controller network, trained offline by previewing future ARDM rollouts and learning stepwise controls that anticipate upcoming observations under a terminal cost objective. We evaluate this framework in the context of data assimilation (DA) for chaotic spatiotemporal partial differential equations (PDEs), a setting where existing methods are often computationally prohibitive and prone to forecast drift under sparse observations. Our approach reduces DA inference to a single forward rollout with on-the-fly corrections, avoiding expensive adjoint computations and/or optimizations during inference. We demonstrate that our method consistently outperforms four state-of-the-art baselines in stability, accuracy, and physical fidelity across two canonical PDEs and six observation regimes. We will release code and checkpoints publicly.
☆ StruSR: Structure-Aware Symbolic Regression with Physics-Informed Taylor Guidance
Symbolic regression aims to find interpretable analytical expressions by searching over mathematical formula spaces to capture underlying system behavior, particularly in scientific modeling governed by physical laws. However, traditional methods lack mechanisms for extracting structured physical priors from time series observations, making it difficult to capture symbolic expressions that reflect the system's global behavior. In this work, we propose a structure-aware symbolic regression framework, called StruSR, that leverages trained Physics-Informed Neural Networks (PINNs) to extract locally structured physical priors from time series data. By performing local Taylor expansions on the outputs of the trained PINN, we obtain derivative-based structural information to guide symbolic expression evolution. To assess the importance of expression components, we introduce a masking-based attribution mechanism that quantifies each subtree's contribution to structural alignment and physical residual reduction. These sensitivity scores steer mutation and crossover operations within genetic programming, preserving substructures with high physical or structural significance while selectively modifying less informative components. A hybrid fitness function jointly minimizes physics residuals and Taylor coefficient mismatch, ensuring consistency with both the governing equations and the local analytical behavior encoded by the PINN. Experiments on benchmark PDE systems demonstrate that StruSR improves convergence speed, structural fidelity, and expression interpretability compared to conventional baselines, offering a principled paradigm for physics-grounded symbolic discovery.
☆ Unsupervised Backdoor Detection and Mitigation for Spiking Neural Networks
Spiking Neural Networks (SNNs) have gained increasing attention for their superior energy efficiency compared to Artificial Neural Networks (ANNs). However, their security aspects, particularly under backdoor attacks, have received limited attention. Existing defense methods developed for ANNs perform poorly or can be easily bypassed in SNNs due to their event-driven and temporal dependencies. This paper identifies the key blockers that hinder traditional backdoor defenses in SNNs and proposes an unsupervised post-training detection framework, Temporal Membrane Potential Backdoor Detection (TMPBD), to overcome these challenges. TMPBD leverages the maximum margin statistics of temporal membrane potential (TMP) in the final spiking layer to detect target labels without any attack knowledge or data access. We further introduce a robust mitigation mechanism, Neural Dendrites Suppression Backdoor Mitigation (NDSBM), which clamps dendritic connections between early convolutional layers to suppress malicious neurons while preserving benign behaviors, guided by TMP extracted from a small, clean, unlabeled dataset. Extensive experiments on multiple neuromorphic benchmarks and state-of-the-art input-aware dynamic trigger attacks demonstrate that TMPBD achieves 100% detection accuracy, while NDSBM reduces the attack success rate from 100% to 8.44%, and to 2.81% when combined with detection, without degrading clean accuracy.
comment: To appear in The 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2025)
☆ FEAorta: A Fully Automated Framework for Finite Element Analysis of the Aorta From 3D CT Images
Aortic aneurysm disease ranks consistently in the top 20 causes of death in the U.S. population. Thoracic aortic aneurysm is manifested as an abnormal bulging of thoracic aortic wall and it is a leading cause of death in adults. From the perspective of biomechanics, rupture occurs when the stress acting on the aortic wall exceeds the wall strength. Wall stress distribution can be obtained by computational biomechanical analyses, especially structural Finite Element Analysis. For risk assessment, probabilistic rupture risk of TAA can be calculated by comparing stress with material strength using a material failure model. Although these engineering tools are currently available for TAA rupture risk assessment on patient specific level, clinical adoption has been limited due to two major barriers: labor intensive 3D reconstruction current patient specific anatomical modeling still relies on manual segmentation, making it time consuming and difficult to scale to a large patient population, and computational burden traditional FEA simulations are resource intensive and incompatible with time sensitive clinical workflows. The second barrier was successfully overcome by our team through the development of the PyTorch FEA library and the FEA DNN integration framework. By incorporating the FEA functionalities within PyTorch FEA and applying the principle of static determinacy, we reduced the FEA based stress computation time to approximately three minutes per case. Moreover, by integrating DNN and FEA through the PyTorch FEA library, our approach further decreases the computation time to only a few seconds per case. This work focuses on overcoming the first barrier through the development of an end to end deep neural network capable of generating patient specific finite element meshes of the aorta directly from 3D CT images.
☆ MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking
Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds-all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes-including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes-spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale-300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: https://github.com/Fengtao191/MSITrack.
☆ A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages
Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.
☆ Self-supervised Physics-guided Model with Implicit Representation Regularization for Fast MRI Reconstruction
Magnetic Resonance Imaging (MRI) is a vital clinical diagnostic tool, yet its widespread application is limited by prolonged scan times. Fast MRI reconstruction techniques effectively reduce acquisition duration by reconstructing high-fidelity MR images from undersampled k-space data. In recent years, deep learning-based methods have demonstrated remarkable progress in this field, with self-supervised and unsupervised learning approaches proving particularly valuable in scenarios where fully sampled data are difficult to obtain. This paper proposes a novel zero-shot self-supervised reconstruction framework named UnrollINR, which enables scan-specific MRI reconstruction without relying on external training data. The method adopts a physics-guided unrolled iterative reconstruction architecture and introduces Implicit Neural Representation (INR) as a regularization prior to effectively constrain the solution space. By combining a deep unrolled structure with the powerful implicit representation capability of INR, the model's interpretability and reconstruction performance are enhanced. Experimental results demonstrate that even at a high acceleration rate of 10, UnrollINR achieves superior reconstruction performance compared to the supervised learning method, validating the superiority of the proposed method.
☆ AIM 2025 Challenge on Real-World RAW Image Denoising
We introduce the AIM 2025 Real-World RAW Image Denoising Challenge, aiming to advance efficient and effective denoising techniques grounded in data synthesis. The competition is built upon a newly established evaluation benchmark featuring challenging low-light noisy images captured in the wild using five different DSLR cameras. Participants are tasked with developing novel noise synthesis pipelines, network architectures, and training methodologies to achieve high performance across different camera models. Winners are determined based on a combination of performance metrics, including full-reference measures (PSNR, SSIM, LPIPS), and non-reference ones (ARNIQA, TOPIQ). By pushing the boundaries of camera-agnostic low-light RAW image denoising trained on synthetic data, the competition promotes the development of robust and practical models aligned with the rapid progress in digital photography. We expect the competition outcomes to influence multiple domains, from image restoration to night-time autonomous driving.
☆ SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation
The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM
☆ Adaptive Stain Normalization for Cross-Domain Medical Histology MICCAI 2025
Deep learning advances have revolutionized automated digital pathology analysis. However, differences in staining protocols and imaging conditions can introduce significant color variability. In deep learning, such color inconsistency often reduces performance when deploying models on data acquired under different conditions from the training data, a challenge known as domain shift. Many existing methods attempt to address this problem via color normalization but suffer from several notable drawbacks such as introducing artifacts or requiring careful choice of a template image for stain mapping. To address these limitations, we propose a trainable color normalization model that can be integrated with any backbone network for downstream tasks such as object detection and classification. Based on the physics of the imaging process per the Beer-Lambert law, our model architecture is derived via algorithmic unrolling of a nonnegative matrix factorization (NMF) model to extract stain-invariant structural information from the original pathology images, which serves as input for further processing. Experimentally, we evaluate the method on publicly available pathology datasets and an internally curated collection of malaria blood smears for cross-domain object detection and classification, where our method outperforms many state-of-the-art stain normalization methods. Our code is available at https://github.com/xutianyue/BeerLaNet.
comment: Accepted to the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2025)
☆ Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
comment: Code released at https://github.com/inclusionAI/Ming-UniVision
☆ Improving Artifact Robustness for CT Deep Learning Models Without Labeled Artifact Images via Domain Adaptation
Deep learning models which perform well on images from their training distribution can degrade substantially when applied to new distributions. If a CT scanner introduces a new artifact not present in the training labels, the model may misclassify the images. Although modern CT scanners include design features which mitigate these artifacts, unanticipated or difficult-to-mitigate artifacts can still appear in practice. The direct solution of labeling images from this new distribution can be costly. As a more accessible alternative, this study evaluates domain adaptation as an approach for training models that maintain classification performance despite new artifacts, even without corresponding labels. We simulate ring artifacts from detector gain error in sinogram space and evaluate domain adversarial neural networks (DANN) against baseline and augmentation-based approaches on the OrganAMNIST abdominal CT dataset. Our results demonstrate that baseline models trained only on clean images fail to generalize to images with ring artifacts, and traditional augmentation with other distortion types provides no improvement on unseen artifact domains. In contrast, the DANN approach successfully maintains high classification accuracy on ring artifact images using only unlabeled artifact data during training, demonstrating the viability of domain adaptation for artifact robustness. The domain-adapted model achieved classification performance on ring artifact test data comparable to models explicitly trained with labeled artifact images, while also showing unexpected generalization to uniform noise. These findings provide empirical evidence that domain adaptation can effectively address distribution shift in medical imaging without requiring expensive expert labeling of new artifact distributions, suggesting promise for deployment in clinical settings where novel artifacts may emerge.
comment: 8 pages, 12 figures, 1 table
☆ Through the Perspective of LiDAR: A Feature-Enriched and Uncertainty-Aware Annotation Pipeline for Terrestrial Point Cloud Segmentation
Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi-automated, uncertainty-aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi-source features, and trains an ensemble of segmentation networks to produce pseudo-labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back-projected to 3D, yielding densely annotated point clouds supported by a three-tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine-channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature-enrichment strategy through cross-dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty-aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high-quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at https://fz-rit.github.io/through-the-lidars-eye/.
☆ HSNet: Heterogeneous Subgraph Network for Single Image Super-resolution
Existing deep learning approaches for image super-resolution, particularly those based on CNNs and attention mechanisms, often suffer from structural inflexibility. Although graph-based methods offer greater representational adaptability, they are frequently impeded by excessive computational complexity. To overcome these limitations, this paper proposes the Heterogeneous Subgraph Network (HSNet), a novel framework that efficiently leverages graph modeling while maintaining computational feasibility. The core idea of HSNet is to decompose the global graph into manageable sub-components. First, we introduce the Constructive Subgraph Set Block (CSSB), which generates a diverse set of complementary subgraphs. Rather than relying on a single monolithic graph, CSSB captures heterogeneous characteristics of the image by modeling different relational patterns and feature interactions, producing a rich ensemble of both local and global graph structures. Subsequently, the Subgraph Aggregation Block (SAB) integrates the representations embedded across these subgraphs. Through adaptive weighting and fusion of multi-graph features, SAB constructs a comprehensive and discriminative representation that captures intricate interdependencies. Furthermore, a Node Sampling Strategy (NSS) is designed to selectively retain the most salient features, thereby enhancing accuracy while reducing computational overhead. Extensive experiments demonstrate that HSNet achieves state-of-the-art performance, effectively balancing reconstruction quality with computational efficiency. The code will be made publicly available.
☆ Cluster Paths: Navigating Interpretability in Neural Networks
While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.
☆ VUGEN: Visual Understanding priors for GENeration
Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.
☆ Quick-CapsNet (QCN): A fast alternative to Capsule Networks
The basic computational unit in Capsule Network (CapsNet) is a capsule (vs. neurons in Convolutional Neural Networks (CNNs)). A capsule is a set of neurons, which form a vector. CapsNet is used for supervised classification of data and has achieved state-of-the-art accuracy on MNIST digit recognition dataset, outperforming conventional CNNs in detecting overlapping digits. Moreover, CapsNet shows higher robustness towards affine transformation when compared to CNNs for MNIST datasets. One of the drawbacks of CapsNet, however, is slow training and testing. This can be a bottleneck for applications that require a fast network, especially during inference. In this work, we introduce Quick-CapsNet (QCN) as a fast alternative to CapsNet, which can be a starting point to develop CapsNet for fast real-time applications. QCN builds on producing a fewer number of capsules, which results in a faster network. QCN achieves this at the cost of marginal loss in accuracy. Inference is 5x faster on MNIST, F-MNIST, SVHN and Cifar-10 datasets. We also further enhanced QCN by employing a more powerful decoder instead of the default decoder to further improve QCN.
☆ MaizeStandCounting (MaSC): Automated and Accurate Maize Stand Counting from UAV Imagery Using Image Processing and Deep Learning
Accurate maize stand counts are essential for crop management and research, informing yield prediction, planting density optimization, and early detection of germination issues. Manual counting is labor-intensive, slow, and error-prone, especially across large or variable fields. We present MaizeStandCounting (MaSC), a robust algorithm for automated maize seedling stand counting from RGB imagery captured by low-cost UAVs and processed on affordable hardware. MaSC operates in two modes: (1) mosaic images divided into patches, and (2) raw video frames aligned using homography matrices. Both modes use a lightweight YOLOv9 model trained to detect maize seedlings from V2-V10 growth stages. MaSC distinguishes maize from weeds and other vegetation, then performs row and range segmentation based on the spatial distribution of detections to produce precise row-wise stand counts. Evaluation against in-field manual counts from our 2024 summer nursery showed strong agreement with ground truth (R^2= 0.616 for mosaics, R^2 = 0.906 for raw frames). MaSC processed 83 full-resolution frames in 60.63 s, including inference and post-processing, highlighting its potential for real-time operation. These results demonstrate MaSC's effectiveness as a scalable, low-cost, and accurate tool for automated maize stand counting in both research and production environments.
comment: 10 pages, 11 figures. Submitted to IEEE Journal of Selected Topics in Signal Processing (JSTSP) Special Series on Artificial Intelligence for Smart Agriculture
☆ Cross-Modal Attention Guided Unlearning in Vision-Language Models
Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.
☆ Label Semantics for Robust Hyperspectral Image Classification
Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN
comment: This work has been accepted for publication in the proceedings of IJCNN 2025
☆ TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.
☆ PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.
☆ D2RA: Dual Domain Regeneration Attack
The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at https://github.com/Pragati-Meshram/DAWN.
☆ MLLM4TS: Leveraging Vision and Multimodal Language Models for General Time-Series Analysis
Effective analysis of time series data presents significant challenges due to the complex temporal dependencies and cross-channel interactions in multivariate data. Inspired by the way human analysts visually inspect time series to uncover hidden patterns, we ask: can incorporating visual representations enhance automated time-series analysis? Recent advances in multimodal large language models have demonstrated impressive generalization and visual understanding capability, yet their application to time series remains constrained by the modality gap between continuous numerical data and discrete natural language. To bridge this gap, we introduce MLLM4TS, a novel framework that leverages multimodal large language models for general time-series analysis by integrating a dedicated vision branch. Each time-series channel is rendered as a horizontally stacked color-coded line plot in one composite image to capture spatial dependencies across channels, and a temporal-aware visual patch alignment strategy then aligns visual patches with their corresponding time segments. MLLM4TS fuses fine-grained temporal details from the numerical data with global contextual information derived from the visual representation, providing a unified foundation for multimodal time-series analysis. Extensive experiments on standard benchmarks demonstrate the effectiveness of MLLM4TS across both predictive tasks (e.g., classification) and generative tasks (e.g., anomaly detection and forecasting). These results underscore the potential of integrating visual modalities with pretrained language models to achieve robust and generalizable time-series analysis.
☆ A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy
Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.
☆ Provably Accelerated Imaging with Restarted Inertia and Score-based Image Priors
Fast convergence and high-quality image recovery are two essential features of algorithms for solving ill-posed imaging inverse problems. Existing methods, such as regularization by denoising (RED), often focus on designing sophisticated image priors to improve reconstruction quality, while leaving convergence acceleration to heuristics. To bridge the gap, we propose Restarted Inertia with Score-based Priors (RISP) as a principled extension of RED. RISP incorporates a restarting inertia for fast convergence, while still allowing score-based image priors for high-quality reconstruction. We prove that RISP attains a faster stationary-point convergence rate than RED, without requiring the convexity of the image prior. We further derive and analyze the associated continuous-time dynamical system, offering insight into the connection between RISP and the heavy-ball ordinary differential equation (ODE). Experiments across a range of imaging inverse problems demonstrate that RISP enables fast convergence while achieving high-quality reconstructions.
comment: 62 pages
☆ DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis
Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.
comment: Preprint. Under review. 26 pages, 11 figures, 11 tables. Access the project page in https://nithincbabu7.github.io/DynamicEval
☆ ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.
♻ ☆ Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning
We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.
♻ ☆ HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping
Novel View Synthesis (NVS) from sparse views presents a formidable challenge in 3D reconstruction, where limited multi-view constraints lead to severe overfitting, geometric distortion, and fragmented scenes. While 3D Gaussian Splatting (3DGS) delivers real-time, high-fidelity rendering, its performance drastically deteriorates under sparse inputs, plagued by floating artifacts and structural failures. To address these challenges, we introduce HBSplat, a unified framework that elevates 3DGS by seamlessly integrating robust structural cues, virtual view constraints, and occluded region completion. Our core contributions are threefold: a Hybrid-Loss Depth Estimation module that ensures multi-view consistency by leveraging dense matching priors and integrating reprojection, point propagation, and smoothness constraints; a Bidirectional Warping Virtual View Synthesis method that enforces substantially stronger constraints by creating high-fidelity virtual views through bidirectional depth-image warping and multi-view fusion; and an Occlusion-Aware Reconstruction component that recovers occluded areas using a depth-difference mask and a learning-based inpainting model. Extensive evaluations on LLFF, Blender, and DTU benchmarks validate that HBSplat sets a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while maintaining real-time inference. Code is available at: https://github.com/eternalland/HBSplat.
comment: 14 pages, 21 figures
♻ ☆ LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning CVPR 2023
Our winning entry for the CVPR 2023 Generic Event Boundary Captioning (GEBC) competition is detailed in this paper. Unlike conventional video captioning tasks, GEBC demands that the captioning model possess an understanding of immediate changes in status around the designated video boundary, making it a difficult task. This paper proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality. (2) To adapt the model to the GEBC task, we take the video Q-former as an adapter and train it with the frozen visual feature extractors and LLM. Our proposed method achieved a 76.14 score on the test set and won the first place in the challenge. Our code is available at https://github.com/zjr2000/LLMVA-GEBC .
comment: Winner solution to Generic Event Boundary Captioning task in LOVEU Challenge (CVPR 2023 workshop)
♻ ☆ V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning AAAI 2025
Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.
comment: Accepted to AAAI 2025
♻ ☆ Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding AAAI 2025
Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.
comment: Accepted to AAAI 2025
♻ ☆ CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion AAAI 2025
Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to decode the saliency maps for the given video accurately. Extensive experiments show the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.
comment: Accepted to AAAI 2025
♻ ☆ Is My Data in Your AI? Membership Inference Test (MINT) applied to Face Biometrics
This article introduces the Membership Inference Test (MINT), a novel approach that aims to empirically assess if given data was used during the training of AI/ML models. Specifically, we propose two MINT architectures designed to learn the distinct activation patterns that emerge when an Audited Model is exposed to data used during its training process. These architectures are based on Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). The experimental framework focuses on the challenging task of Face Recognition, considering three state-of-the-art Face Recognition systems. Experiments are carried out using six publicly available databases, comprising over 22 million face images in total. Different experimental scenarios are considered depending on the context of the AI model to test. Our proposed MINT approach achieves promising results, with up to 90\% accuracy, indicating the potential to recognize if an AI model has been trained with specific data. The proposed MINT approach can serve to enforce privacy and fairness in several AI applications, e.g., revealing if sensitive or private data was used for training or tuning Large Language Models (LLMs).
comment: 11 pages main text + 2 pages references and 1 pages appendix
♻ ☆ VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? CVPR 2025
The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at https://yunlong10.github.io/VidComposition/.
comment: Accepted to CVPR 2025
♻ ☆ Lossy Neural Compression for Geospatial Analytics: A Review
Over the past decades, there has been an explosion in the amount of available Earth Observation (EO) data. The unprecedented coverage of the Earth's surface and atmosphere by satellite imagery has resulted in large volumes of data that must be transmitted to ground stations, stored in data centers, and distributed to end users. Modern Earth System Models (ESMs) face similar challenges, operating at high spatial and temporal resolutions, producing petabytes of data per simulated day. Data compression has gained relevance over the past decade, with neural compression (NC) emerging from deep learning and information theory, making EO data and ESM outputs ideal candidates due to their abundance of unlabeled data. In this review, we outline recent developments in NC applied to geospatial data. We introduce the fundamental concepts of NC including seminal works in its traditional applications to image and video compression domains with focus on lossy compression. We discuss the unique characteristics of EO and ESM data, contrasting them with "natural images", and explain the additional challenges and opportunities they present. Moreover, we review current applications of NC across various EO modalities and explore the limited efforts in ESM compression to date. The advent of self-supervised learning (SSL) and foundation models (FM) has advanced methods to efficiently distill representations from vast unlabeled data. We connect these developments to NC for EO, highlighting the similarities between the two fields and elaborate on the potential of transferring compressed feature representations for machine--to--machine communication. Based on insights drawn from this review, we devise future directions relevant to applications in EO and ESM.
comment: self-consistent review paper
♻ ☆ MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness NeurIPS 2025
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
comment: Accepted to NeurIPS 2025 DB Track
♻ ☆ KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
comment: Work in progress
♻ ☆ RGS-DR: Deferred Reflections and Residual Shading in 2D Gaussian Splatting
In this work, we address specular appearance in inverse rendering using 2D Gaussian splatting with deferred shading and argue for a refinement stage to improve specular detail, thereby bridging the gap with reconstruction-only methods. Our pipeline estimates editable material properties and environment illumination while employing a directional residual pass that captures leftover view-dependent effects for further refining novel view synthesis. In contrast to per-Gaussian shading with shortest-axis normals and normal residuals, which tends to result in more noisy geometry and specular appearance, a pixel-deferred surfel formulation with specular residuals yields sharper highlights, cleaner materials, and improved editability. We evaluate our approach on rendering and reconstruction quality on three popular datasets featuring glossy objects, and also demonstrate high-quality relighting and material editing.
♻ ☆ Generative AI for Cel-Animation: A Survey ICCV 2025
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation
comment: Accepted by ICCV 2025 AISTORY Workshop
♻ ☆ AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.
comment: 8 pages, 6 figures
♻ ☆ Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward ACCV 2022
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can perform efficient and coherent segment assemblage task end-to-end. It utilizes multi-modal representation extracted from the segments and follows the Encoder-Decoder Ptr-Net framework with the Attention mechanism. Importance-coherence reward is designed for training M-SAN. We experiment on the Ads-1k dataset with 1000+ videos under rich ad scenarios collected from advertisers. To evaluate the methods, we propose a unified metric, Imp-Coh@Time, which comprehensively assesses the importance, coherence, and duration of the outputs at the same time. Experimental results show that our method achieves better performance than random selection and the previous method on the metric. Ablation experiments further verify that multi-modal representation and importance-coherence reward significantly improve the performance. Ads-1k dataset is available at: https://github.com/yunlong10/Ads-1k
comment: Accepted by ACCV 2022
♻ ☆ Video Understanding with Large Language Models: A Survey
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
comment: Accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
♻ ☆ BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring
Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.
♻ ☆ SubGrapher: Visual Fingerprinting of Chemical Structures
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code are publicly available.
♻ ☆ Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
comment: The 1st version
♻ ☆ Human Action Recognition from Point Clouds over Time
Recent research into human action recognition (HAR) has focused predominantly on skeletal action recognition and video-based methods. With the increasing availability of consumer-grade depth sensors and Lidar instruments, there is a growing opportunity to leverage dense 3D data for action recognition, to develop a third way. This paper presents a novel approach for recognizing actions from 3D videos by introducing a pipeline that segments human point clouds from the background of a scene, tracks individuals over time, and performs body part segmentation. The method supports point clouds from both depth sensors and monocular depth estimation. At the core of the proposed HAR framework is a novel backbone for 3D action recognition, which combines point-based techniques with sparse convolutional networks applied to voxel-mapped point cloud sequences. Experiments incorporate auxiliary point features including surface normals, color, infrared intensity, and body part parsing labels, to enhance recognition accuracy. Evaluation on the NTU RGB- D 120 dataset demonstrates that the method is competitive with existing skeletal action recognition algorithms. Moreover, combining both sensor-based and estimated depth inputs in an ensemble setup, this approach achieves 89.3% accuracy when different human subjects are considered for training and testing, outperforming previous point cloud action recognition methods.
♻ ☆ Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine AAAI2025
In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at https://github.com/ShawnHuang497/MedPLIB.
comment: Accepted by AAAI2025
♻ ☆ MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning
Learning object-level, structured representations is widely regarded as a key to better generalization in vision and underpins the design of next-generation Pre-trained Vision Models (PVMs). Mainstream Object-Centric Learning (OCL) methods adopt Slot Attention or its variants to iteratively aggregate objects' super-pixels into a fixed set of query feature vectors, termed slots. However, their reliance on a static slot count leads to an object being represented as multiple parts when the number of objects varies. We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts. MetaSlot (i) maintains a codebook that holds prototypes of objects in a dataset by vector-quantizing the resulting slot representations; (ii) removes duplicate slots from the traditionally aggregated slots by quantizing them with the codebook; and (iii) injects progressively weaker noise into the Slot Attention iterations to accelerate and stabilize the aggregation. MetaSlot is a general Slot Attention variant that can be seamlessly integrated into existing OCL architectures. Across multiple public datasets and tasks--including object discovery and recognition--models equipped with MetaSlot achieve significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.
♻ ☆ PolyPose: Deformable 2D/3D Registration via Polyrigid Transformations NeurIPS 2025
Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise-rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient's preoperative volume to as few as two X-rays, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail. Additional visualizations, tutorials, and code are available at https://polypose.csail.mit.edu.
comment: NeurIPS 2025. Code available at https://github.com/eigenvivek/polypose
♻ ☆ Sustainable Self-evolution Adversarial Training
With the wide application of deep neural network models in various computer vision tasks, there has been a proliferation of adversarial example generation strategies aimed at deeply exploring model security. However, existing adversarial training defense models, which rely on single or limited types of attacks under a one-time learning process, struggle to adapt to the dynamic and evolving nature of attack methods. Therefore, to achieve defense performance improvements for models in long-term applications, we propose a novel Sustainable Self-Evolution Adversarial Training (SSEAT) framework. Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. Additionally, to address the issue of model catastrophic forgetting caused by continual learning from ongoing novel attacks, we propose an adversarial data replay module to better select more diverse and key relearning data. Furthermore, we design a consistency regularization strategy to encourage current defense models to learn more from previously trained ones, guiding them to retain more past knowledge and maintain accuracy on clean samples. Extensive experiments have been conducted to verify the efficacy of the proposed SSEAT defense method, which demonstrates superior defense performance and classification accuracy compared to competitors.Code is available at https://github.com/aup520/SSEAT
comment: Accepted to ACMMM 2024
♻ ☆ RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation NeurIPS 2025
The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.
comment: Accepted at NeurIPS 2025
♻ ☆ Maximising the Utility of Validation Sets for Imbalanced Noisy-label Meta-learning
Meta-learning is an effective method to handle imbalanced and noisy-label learning, but it depends on a validation set containing randomly selected, manually labelled and balanced distributed samples. The random selection and manual labelling and balancing of this validation set is not only sub-optimal for meta-learning, but it also scales poorly with the number of classes. Hence, recent meta-learning papers have proposed ad-hoc heuristics to automatically build and label this validation set, but these heuristics are still sub-optimal for meta-learning. In this paper, we analyse the meta-learning algorithm and propose new criteria to characterise the utility of the validation set, based on: 1) the informativeness of the validation set; 2) the class distribution balance of the set; and 3) the correctness of the labels of the set. Furthermore, we propose a new imbalanced noisy-label meta-learning (INOLML) algorithm that automatically builds a validation set by maximising its utility using the criteria above. Our method shows significant improvements over previous meta-learning approaches and sets the new state-of-the-art on several benchmarks.
♻ ☆ Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics
In the field of Image Quality Assessment (IQA), the adversarial robustness of the metrics poses a critical concern. This paper presents a comprehensive benchmarking study of various defense mechanisms in response to the rise in adversarial attacks on IQA. We systematically evaluate 25 defense strategies, including adversarial purification, adversarial training, and certified robustness methods. We applied 14 adversarial attack algorithms of various types in both non-adaptive and adaptive settings and tested these defenses against them. We analyze the differences between defenses and their applicability to IQA tasks, considering that they should preserve IQA scores and image quality. The proposed benchmark aims to guide future developments and accepts submissions of new methods, with the latest results available online: https://videoprocessing.ai/benchmarks/iqa-defenses.html.
♻ ☆ Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models NeurIPS
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.
comment: The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/
♻ ☆ DiffMI: Breaking Face Recognition Privacy via Diffusion-Driven Training-Free Model Inversion
Face recognition poses serious privacy risks due to its reliance on sensitive and immutable biometric data. While modern systems mitigate privacy risks by mapping facial images to embeddings (commonly regarded as privacy-preserving), model inversion attacks reveal that identity information can still be recovered, exposing critical vulnerabilities. However, existing attacks are often computationally expensive and lack generalization, especially those requiring target-specific training. Even training-free approaches suffer from limited identity controllability, hindering faithful reconstruction of nuanced or unseen identities. In this work, we propose DiffMI, the first diffusion-driven, training-free model inversion attack. DiffMI introduces a novel pipeline combining robust latent code initialization, a ranked adversarial refinement strategy, and a statistically grounded, confidence-aware optimization objective. DiffMI applies directly to unseen target identities and face recognition models, offering greater adaptability than training-dependent approaches while significantly reducing computational overhead. Our method achieves 84.42%--92.87% attack success rates against inversion-resilient systems and outperforms the best prior training-free GAN-based approach by 4.01%--9.82%. The implementation is available at https://github.com/azrealwang/DiffMI.
♻ ☆ GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm
Deep neural networks are highly vulnerable to adversarial examples that inputs with small, carefully crafted perturbations that cause misclassification, making adversarial attacks an essential tool for robustness evaluation. Existing black-box attacks fall into three categories: query-only, transfer-only, and query-and-transfer, and vary in perturbation pattern and optimization strategy. However, no prior method jointly achieves query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization, leaving a gap between black-box flexibility and white-box precision. We present GreedyPixel, a new attack framework that fills this gap by combining a surrogate-derived pixel priority map with greedy, per-pixel optimization refined by query feedback. This design reduces the exponential brute-force search space to a tractable linear procedure, guarantees monotonic loss decrease and convergence to a coordinate-wise optimum, and concentrates perturbations on robust, semantically meaningful pixels to improve perceptual quality. Extensive experiments on CIFAR-10 and ImageNet under both white-box and black-box settings demonstrate that GreedyPixel achieves state-of-the-art attack success rates and produces visually imperceptible perturbations. Our results show that GreedyPixel bridges the precision gap between white-box and black-box attacks and provides a practical framework for fine-grained robustness evaluation. The implementation is available at https://github.com/azrealwang/greedypixel.
♻ ☆ Color Bind: Exploring Color Perception in Text-to-Image Models
Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors -- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.
comment: Project webpage: https://tau-vailab.github.io/color-edit/
♻ ☆ Uncertainty-Aware Remaining Lifespan Prediction from Images
Predicting mortality-related outcomes from images offers the prospect of accessible, noninvasive, and scalable health screening. We present a method that leverages pretrained vision transformer foundation models to estimate remaining lifespan from facial and whole-body images, alongside robust uncertainty quantification. We show that predictive uncertainty varies systematically with the true remaining lifespan, and that this uncertainty can be effectively modeled by learning a Gaussian distribution for each sample. Our approach achieves state-of-the-art mean absolute error (MAE) of 7.41 years on an established dataset, and further achieves 4.91 and 4.99 years MAE on two new, higher-quality datasets curated and published in this work. Importantly, our models provide calibrated uncertainty estimates, as demonstrated by a bucketed expected calibration error of 0.82 years on the Faces Dataset. While not intended for clinical deployment, these results highlight the potential of extracting medically relevant signals from images. We make all code and datasets available to facilitate further research.
comment: Submitted to ISVC 2025
♻ ☆ Fully Spiking Neural Networks for Unified Frame-Event Object Tracking NeurIPS2025
The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking Frame-Event Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency.
comment: Accepted by NeurIPS2025
♻ ☆ Domain Generalization by Rejecting Extreme Augmentations WACV 2024
Data augmentation is one of the most effective techniques for regularizing deep learning models and improving their recognition performance in a variety of tasks and domains. However, this holds for standard in-domain settings, in which the training and test data follow the same distribution. For the out-of-domain case, where the test data follow a different and unknown distribution, the best recipe for data augmentation is unclear. In this paper, we show that for out-of-domain and domain generalization settings, data augmentation can provide a conspicuous and robust improvement in performance. To do that, we propose a simple training procedure: (i) use uniform sampling on standard data augmentation transformations; (ii) increase the strength transformations to account for the higher data variance expected when working out-of-domain, and (iii) devise a new reward function to reject extreme transformations that can harm the training. With this procedure, our data augmentation scheme achieves a level of accuracy that is comparable to or better than state-of-the-art methods on benchmark domain generalization datasets. Code: https://github.com/Masseeh/DCAug
comment: WACV 2024: Winter Conference on Applications of Computer Vision 2024
♻ ☆ Unified Unsupervised Anomaly Detection via Matching Cost Filtering
Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB-3D and RGB-Text, enabled by point cloud sensing and vision-language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB-3D, RGB-Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
comment: 63 pages (main paper and supplementary material), 39 figures, 58 tables
♻ ☆ Train-Free Segmentation in MRI with Cubical Persistent Homology
We present a new general framework for segmentation of MRI scans based on Topological Data Analysis (TDA), offering several advantages over traditional machine learning approaches. The pipeline proceeds in three steps, first identifying the whole object to segment via automatic thresholding, then detecting a distinctive subset whose topology is known in advance, and finally deducing the various components of the segmentation. Unlike most prior TDA uses in medical image segmentation, which are typically embedded within deep networks, our approach is a standalone method tailored to MRI. A key ingredient is the localization of representative cycles from the persistence diagram, which enables interpretable mappings from topological features to anatomical components. In particular, the method offers the ability to perform segmentation without the need for large annotated datasets. Its modular design makes it adaptable to a wide range of data segmentation challenges. We validate the framework on three applications: glioblastoma segmentation in brain MRI, where a sphere is to be detected; myocardium in cardiac MRI, forming a cylinder; and cortical plate detection in fetal brain MRI, whose 2D slices are circles. We compare our method with established supervised and unsupervised baselines.
comment: Preprint, 36 pages, 18 figures, 4 tables. For associated code, see https://github.com/antonfrancois/gliomaSegmentation_TDA
♻ ☆ Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion ICRA 2025
Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.
comment: Accepted by ICRA 2025
♻ ☆ Intelligent Healthcare Imaging Platform: A VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation
The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.
comment: 32 pages, 14 figures, 6 tables
♻ ☆ Unlocking Dataset Distillation with Diffusion Models
Dataset distillation seeks to condense datasets into smaller but highly representative synthetic samples. While diffusion models now lead all generative benchmarks, current distillation methods avoid them and rely instead on GANs or autoencoders, or, at best, sampling from a fixed diffusion prior. This trend arises because naive backpropagation through the long denoising chain leads to vanishing gradients, which prevents effective synthetic sample optimization. To address this limitation, we introduce Latent Dataset Distillation with Diffusion Models (LD3M), the first method to learn gradient-based distilled latents and class embeddings end-to-end through a pre-trained latent diffusion model. A linearly decaying skip connection, injected from the initial noisy state into every reverse step, preserves the gradient signal across dozens of timesteps without requiring diffusion weight fine-tuning. Across multiple ImageNet subsets at 128x128 and 256x256, LD3M improves downstream accuracy by up to 4.8 percentage points (1 IPC) and 4.2 points (10 IPC) over the prior state-of-the-art. The code for LD3M is provided at https://github.com/Brian-Moser/prune_and_distill.
♻ ☆ Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction
The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ
comment: Project Page: https://yanchi-3dv.github.io/PG-Occ
♻ ☆ Generative Pre-trained Autoregressive Diffusion Transformer
In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.
♻ ☆ IRNet: Iterative Refinement Network for Noisy Partial Label Learning
Partial label learning (PLL) is a typical weakly supervised learning, where each sample is associated with a set of candidate labels. Its basic assumption is that the ground-truth label must be in the candidate set, but this assumption may not be satisfied due to the unprofessional judgment of annotators. Therefore, we relax this assumption and focus on a more general task, noisy PLL, where the ground-truth label may not exist in the candidate set. To address this challenging task, we propose a novel framework called ``Iterative Refinement Network (IRNet)'', aiming to purify noisy samples through two key modules (i.e., noisy sample detection and label correction). To achieve better performance, we exploit smoothness constraints to reduce prediction errors in these modules. Through theoretical analysis, we prove that IRNet is able to reduce the noise level of the dataset and eventually approximate the Bayes optimal classifier. Meanwhile, IRNet is a plug-in strategy that can be integrated with existing PLL approaches. Experimental results on multiple benchmark datasets show that IRNet outperforms state-of-the-art approaches on noisy PLL. Our source code is available at: https://github.com/zeroQiaoba/IRNet.
♻ ☆ OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search
Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.
comment: Some of the online experimental results in the paper are significantly different from the actual results, and need to be re-experimented and revised before submission. The current version is prone to misunderstanding
♻ ☆ Taming Diffusion Models for Image Restoration: A Review
Diffusion models have achieved remarkable progress in generative modelling, particularly in enhancing image quality to conform to human preferences. Recently, these models have also been applied to low-level computer vision for photo-realistic image restoration (IR) in tasks such as image denoising, deblurring, dehazing, etc. In this review paper, we introduce key constructions in diffusion models and survey contemporary techniques that make use of diffusion models in solving general IR tasks. Furthermore, we point out the main challenges and limitations of existing diffusion-based IR frameworks and provide potential directions for future work.
comment: This paper has been published in Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
♻ ☆ A Deep Learning System for Rapid and Accurate Warning of Acute Aortic Syndrome on Non-contrast CT in China
The accurate and timely diagnosis of acute aortic syndromes (AAS) in patients presenting with acute chest pain remains a clinical challenge. Aortic CT angiography (CTA) is the imaging protocol of choice in patients with suspected AAS. However, due to economic and workflow constraints in China, the majority of suspected patients initially undergo non-contrast CT as the initial imaging testing, and CTA is reserved for those at higher risk. In this work, we present an artificial intelligence-based warning system, iAorta, using non-contrast CT for AAS identification in China, which demonstrates remarkably high accuracy and provides clinicians with interpretable warnings. iAorta was evaluated through a comprehensive step-wise study. In the multi-center retrospective study (n = 20,750), iAorta achieved a mean area under the receiver operating curve (AUC) of 0.958 (95% CI 0.950-0.967). In the large-scale real-world study (n = 137,525), iAorta demonstrated consistently high performance across various non-contrast CT protocols, achieving a sensitivity of 0.913-0.942 and a specificity of 0.991-0.993. In the prospective comparative study (n = 13,846), iAorta demonstrated the capability to significantly shorten the time to correct diagnostic pathway. For the prospective pilot deployment that we conducted, iAorta correctly identified 21 out of 22 patients with AAS among 15,584 consecutive patients presenting with acute chest pain and under non-contrast CT protocol in the emergency department (ED) and enabled the average diagnostic time of these 21 AAS positive patients to be 102.1 (75-133) mins. Last, the iAorta can help avoid delayed or missed diagnosis of AAS in settings where non-contrast CT remains the unavoidable the initial or only imaging test in resource-constrained regions and in patients who cannot or did not receive intravenous contrast.
♻ ☆ GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning
Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.
comment: 4 pages, 2 figures
♻ ☆ Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform
This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.
comment: Code available at http://github.com/kocurvik/nico_gaze
♻ ☆ The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
Cognitive science research treats visual perception, the ability to understand and make sense of a visual input, as one of the early developmental signs of intelligence. Its TVPS-4 framework categorizes and tests human perception into seven skills such as visual discrimination, and form constancy. Do Multimodal Large Language Models (MLLMs) match up to humans in basic perception? Even though there are many benchmarks that evaluate MLLMs on advanced reasoning and knowledge skills, there is limited research that focuses evaluation on simple perception. In response, we introduce Percept-V, a dataset containing 6000 program-generated uncontaminated images divided into 30 domains, where each domain tests one or more TVPS-4 skills. Our focus is on perception, so we make our domains quite simple and the reasoning and knowledge required for solving them are minimal. Since modern-day MLLMs can solve much more complex tasks, our a-priori expectation is that they will solve these domains very easily. Contrary to our belief, our experiments show a weak performance of SoTA proprietary and open-source MLLMs compared to very high human performance on Percept-V. We find that as number of objects in the image increases, performance goes down rather fast. Our experiments also identify the perception skills that are considerably harder for all models.
♻ ☆ Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA
Continual learning (CL) aims to accumulate knowledge from sequential tasks without catastrophic forgetting. Vision-language models such as CLIP, with strong generalization, are widely used for CL. Existing methods often adapt isolated PTM components, increasing inference complexity and limiting model improvement, or rely on replay, stored data, or assumptions, leading to high costs and limited applicability. To advance models as continual learners, we explore CL through natural and efficient PTM updates rather than complex task-specific additions. We study continual low-rank learning and analyze how LoRA ranks and placements affect learning and forgetting. A higher-rank LoRA improves task learning (plasticity) but increases forgetting, while a lower-rank LoRA enhances stability but limits adaptation. We observe a plasticity-stability balance tied to rank across parameters and tasks, with moderately small ranks maximizing CL benefits. Motivated by this, we propose Continual Dynamic Rank-Selective LoRA (CoDyRA), which continually updates PTMs with LoRA adapters of adaptively optimized ranks. The new-task objective drives learning, while sparsity-promoting regularization minimizes ranks to reduce interference and forgetting, achieving a balance tailored to each parameter and task. Although all parameters are updated, the minimized ranks keep the model close to its prior state while enabling effective new-task learning. CoDyRA performs efficient CL as a sequence of LoRA-based updates without storing past data or relying on assumptions, preserving the original model architecture and adding no inference overhead. Experiments show CoDyRA improves new representations while retaining old knowledge, achieving state-of-the-art results. Code is available at https://github.com/jeff024/codyra.
comment: Preprint
♻ ☆ acia-workflows: Automated Single-cell Imaging Analysis for Scalable and Deep Learning-based Live-cell Imaging Analysis Workflows
Live-cell imaging (LCI) technology enables the detailed spatio-temporal characterization of living cells at the single-cell level, which is critical for advancing research in the life sciences, from biomedical applications to bioprocessing. High-throughput setups with tens to hundreds of parallel cell cultivations offer the potential for robust and reproducible insights. However, these insights are obscured by the large amount of LCI data recorded per experiment. Recent advances in state-of-the-art deep learning methods for cell segmentation and tracking now enable the automated analysis of such large data volumes, offering unprecedented opportunities to systematically study single-cell dynamics. The next key challenge lies in integrating these powerful tools into accessible, flexible, and user-friendly workflows that support routine application in biological research. In this work, we present acia-workflows, a platform that combines three key components: (1) the Automated live-Cell Imaging Analysis (acia) Python library, which supports the modular design of image analysis pipelines offering eight deep learning segmentation and tracking approaches; (2) workflows that assemble the image analysis pipeline, its software dependencies, documentation, and visualizations into a single Jupyter Notebook, leading to accessible, reproducible and scalable analysis workflows; and (3) a collection of application workflows showcasing the analysis and customization capabilities in real-world applications. Specifically, we present three workflows to investigate various types of microfluidic LCI experiments ranging from growth rate comparisons to precise, minute-resolution quantitative analyses of individual dynamic cells responses to changing oxygen conditions. Our collection of more than ten application workflows is open source and publicly available at https://github.com/JuBiotech/acia-workflows.
♻ ☆ VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning NeurIPS 2025
Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
comment: Accepted by NeurIPS 2025
♻ ☆ VGGT-X: When VGGT Meets Dense Novel View Synthesis
We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/
comment: Project Page: https://dekuliutesla.github.io/vggt-x.github.io/
♻ ☆ Efficient Universal Models for Medical Image Segmentation via Weakly Supervised In-Context Learning
Universal models for medical image segmentation, such as interactive and in-context learning (ICL) models, offer strong generalization but require extensive annotations. Interactive models need repeated user prompts for each image, while ICL relies on dense, pixel-level labels. To address this, we propose Weakly Supervised In-Context Learning (WS-ICL), a new ICL paradigm that leverages weak prompts (e.g., bounding boxes or points) instead of dense labels for context. This approach significantly reduces annotation effort by eliminating the need for fine-grained masks and repeated user prompting for all images. We evaluated the proposed WS-ICL model on three held-out benchmarks. Experimental results demonstrate that WS-ICL achieves performance comparable to regular ICL models at a significantly lower annotation cost. In addition, WS-ICL is highly competitive even under the interactive paradigm. These findings establish WS-ICL as a promising step toward more efficient and unified universal models for medical image segmentation. Our code and model are publicly available at https://github.com/jiesihu/Weak-ICL.
♻ ☆ DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting
Sparse-view 3D Gaussian Splatting (3DGS) presents significant challenges in reconstructing high-quality novel views, as it often overfits to the widely-varying high-frequency (HF) details of the sparse training views. While frequency regularization can be a promising approach, its typical reliance on Fourier transforms causes difficult parameter tuning and biases towards detrimental HF learning. We propose DWTGS, a framework that rethinks frequency regularization by leveraging wavelet-space losses that provide additional spatial supervision. Specifically, we supervise only the low-frequency (LF) LL subbands at multiple DWT levels, while enforcing sparsity on the HF HH subband in a self-supervised manner. Experiments across benchmarks show that DWTGS consistently outperforms Fourier-based counterparts, as this LF-centric strategy improves generalization and reduces HF hallucinations.
comment: Accepted to VCIP 2025
♻ ☆ Robot Learning from Any Images
We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources, including camera captures, robotic datasets, and Internet images. At its core, our approach combines a novel method for single-view physical scene recovery with an efficient visual blending strategy for photorealistic data collection. We demonstrate RoLA's versatility across applications like scalable robotic data generation and augmentation, robot learning from Internet images, and single-image real-to-sim-to-real systems for manipulators and humanoids. Video results are available at https://sihengz02.github.io/RoLA .
comment: CoRL 2025 camera ready
♻ ☆ TFM Dataset: A Novel Multi-task Dataset and Integrated Pipeline for Automated Tear Film Break-Up Segmentation
Tear film break-up (TFBU) analysis is critical for diagnosing dry eye syndrome, but automated TFBU segmentation remains challenging due to the lack of annotated datasets and integrated solutions. This paper introduces the Tear Film Multi-task (TFM) Dataset, the first comprehensive dataset for multi-task tear film analysis, comprising 15 high-resolution videos (totaling 6,247 frames) annotated with three vision tasks: frame-level classification ('clear', 'closed', 'broken', 'blur'), Placido Ring detection, and pixel-wise TFBU area segmentation. Leveraging this dataset, we first propose TF-Net, a novel and efficient baseline segmentation model. TF-Net incorporates a MobileOne-mini backbone with re-parameterization techniques and an enhanced feature pyramid network to achieve a favorable balance between accuracy and computational efficiency for real-time clinical applications. We further establish benchmark performance on the TFM segmentation subset by comparing TF-Net against several state-of-the-art medical image segmentation models. Furthermore, we design TF-Collab, a novel integrated real-time pipeline that synergistically leverages models trained on all three tasks of the TFM dataset. By sequentially orchestrating frame classification for BUT determination, pupil region localization for input standardization, and TFBU segmentation, TF-Collab fully automates the analysis. Experimental results demonstrate the effectiveness of the proposed TF-Net and TF-Collab, providing a foundation for future research in ocular surface diagnostics. Our code and the TFM datasets are available at https://github.com/glory-wan/TF-Net
♻ ☆ HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models NeurIPS 2025
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.
comment: NeurIPS 2025
♻ ☆ SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models
Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge. To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce \textbf{SafeGuider}, a two-step framework designed for robust safety control without compromising generation quality. SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48\% across various attack scenarios. Moreover, instead of refusing to generate or producing black images for unsafe prompts, \textbf{SafeGuider} generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures. We hope that SafeGuider can shed some light on the practical deployment of secure text-to-image systems.
comment: Accepted by ACM CCS 2025
♻ ☆ LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition
The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model's focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.
comment: Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
♻ ☆ Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization
Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM's poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.
comment: 19pages, 5figures, 6tables
♻ ☆ MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
Multimodal large language models with Retrieval Augmented Generation (RAG) have significantly advanced tasks such as multimodal question answering by grounding responses in external text and images. This grounding improves factuality, reduces hallucination, and extends reasoning beyond parametric knowledge. However, this reliance on external knowledge poses a critical yet underexplored safety risk: knowledge poisoning attacks, where adversaries deliberately inject adversarial multimodal content into external knowledge bases to steer model toward generating incorrect or even harmful responses. To expose such vulnerabilities, we propose MM-PoisonRAG, the first framework to systematically design knowledge poisoning in multimodal RAG. We introduce two complementary attack strategies: Localized Poisoning Attack (LPA), which implants targeted multimodal misinformation to manipulate specific queries, and Globalized Poisoning Attack (GPA), which inserts a single adversarial knowledge to broadly disrupt reasoning and induce nonsensical responses across all queries. Comprehensive experiments across tasks, models, and access settings show that LPA achieves targeted manipulation with attack success rates of up to 56%, while GPA completely disrupts model generation to 0% accuracy with just a single adversarial knowledge injection. Our results reveal the fragility of multimodal RAG and highlight the urgent need for defenses against knowledge poisoning.
comment: Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG
♻ ☆ Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting NeurIPS 2025
3D reconstruction from in-the-wild images remains a challenging task due to inconsistent lighting conditions and transient distractors. Existing methods typically rely on heuristic strategies to handle the low-quality training data, which often struggle to produce stable and consistent reconstructions, frequently resulting in visual artifacts. In this work, we propose \modelname{}, a novel framework that leverages the stochastic nature of these artifacts: they tend to vary across different training runs due to minor randomness. Specifically, our method trains two 3D Gaussian Splatting (3DGS) models in parallel, enforcing a consistency constraint that encourages convergence on reliable scene geometry while suppressing inconsistent artifacts. To prevent the two models from collapsing into similar failure modes due to confirmation bias, we introduce a divergent masking strategy that applies two complementary masks: a multi-cue adaptive mask and a self-supervised soft mask, which leads to an asymmetric training process of the two models, reducing shared error modes. In addition, to improve the efficiency of model training, we introduce a lightweight variant called Dynamic EMA Proxy, which replaces one of the two models with a dynamically updated Exponential Moving Average (EMA) proxy, and employs an alternating masking strategy to preserve divergence. Extensive experiments on challenging real-world datasets demonstrate that our method consistently outperforms existing approaches while achieving high efficiency. See the project website at https://steveli88.github.io/AsymGS.
comment: NeurIPS 2025 Spotlight; Project page: https://steveli88.github.io/AsymGS/
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Decomposed Global Optimization for Robust Point Matching with Low-Dimensional Branching
Numerous applications require algorithms that can align partially overlapping point sets while maintaining invariance to geometric transformations (e.g., similarity, affine, rigid). This paper introduces a novel global optimization method for this task by minimizing the objective function of the Robust Point Matching (RPM) algorithm. We first reveal that the original RPM objective is a cubic polynomial. Through a concise variable substitution, we transform this objective into a quadratic function. By leveraging the convex envelope of bilinear monomials, we derive a tight lower bound for this quadratic function. This lower bound problem conveniently and efficiently decomposes into two parts: a standard linear assignment problem (solvable in polynomial time) and a low-dimensional convex quadratic program. Furthermore, we devise a specialized Branch-and-Bound (BnB) algorithm that branches exclusively on the transformation parameters, which significantly accelerates convergence by confining the search space. Experiments on 2D and 3D synthetic and real-world data demonstrate that our method, compared to state-of-the-art approaches, exhibits superior robustness to non-rigid deformations, positional noise, and outliers, particularly in scenarios where outliers are distinct from inliers.
♻ ☆ Platonic Transformers: A Solid Choice For Equivariance
While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.
♻ ☆ Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout
We propose Orthogonal Monte Carlo Dropout, a mechanism that enforces strict orthogonality when combining sparse semantic vectors without extra time complexity. Low-Rank Adaptation (LoRA), a popular fine-tuning method for large models, typically trains a module to represent a specific concept such as an object or a style. When multiple LoRA modules are merged, for example to generate an object in a particular style, their outputs (semantic vectors) may interfere with each other. Our method guarantees that merged LoRA modules remain orthogonal and thus free from direct interference. However, empirical analysis reveals that such orthogonality does not lead to the semantic disentanglement highlighted in prior work on compositional adaptation. This finding suggests that inter-LoRA orthogonality alone may be insufficient for achieving true semantic compositionality, prompting a re-examination of its role in adapter merging.
♻ ☆ RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models NeurIPS 2025
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC
comment: Accepted to NeurIPS 2025
♻ ☆ Context Matters: Learning Global Semantics via Object-Centric Representation
Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper, we argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes, and such a gap can be narrowed through the design of a semantic-grounded objective. Specifically, we notice that individual words in natural language are inherently semantic, and modeling directly on word tokens naturally learns a realistic distribution. In contrast, ViTs rely on spatial patchification, which inevitably lacks semantic information. To bridge this gap, we propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements. We investigate our hypotheses via masked image modeling (MIM), a framework where our approach can be readily tested by applying masks to visual objects rather than random patches. Considerable evidence from qualitative and quantitative evaluations reveals a key finding: object-level representation alone helps to learn a real-world distribution, whereas pixel-averaging shortcuts are often learned without it. Moreover, further evaluations with multimodal LLMs (MLLM) on visual question answering (VQA, GQA, ScienceQA) tasks demonstrate the strong reasoning and contextual understanding gained with this simple objective. We hope our study highlights the effectiveness of object-level encoding and provides a plausible direction for developing stronger vision encoders and tokenizers. Code and model will be publicly released. Keywords: Semantic Visual Tokenizer, Vision Reasoning, In-context Learning, Multimodal Reasoning
♻ ☆ Can Vision Language Models Infer Human Gaze Direction? A Controlled Study
The ability to infer what others are looking at is a critical component of a theory of mind that underpins natural human-AI interaction. We characterized this skill in 111 Vision Language Models (VLMs) and human participants (N = 65) using photos taken with manipulated difficulty and variability. We found that 94 of the 111 VLMs were not better than random guessing, while humans achieved near-ceiling accuracy. VLMs respond with each choice almost equally frequently. Are they randomly guessing? At least for five top-tier VLMs, their performance was above chance, declined with increasing task difficulty, but barely varied across different prompts and scene objects. These behavioral patterns cannot be explained by considering VLMs as random guessers. Instead, they likely utilize head orientation but not eye appearance to infer gaze direction, such that their performance is imperfect, subject to the task difficulty, but robust to superficial perceptual variations. This suggests that VLMs, lacking effective gaze inference skills, have yet to become technologies that can naturally interact with humans, but the potential remains.
comment: Manuscript under review. Project page at https://grow-ai-like-a-child.github.io/gaze/
♻ ☆ Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes ICCV 2025
Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.
comment: ICCV 2025
♻ ☆ Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining
The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
♻ ☆ MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification
We present a simple yet effective method to improve the robustness of both Convolutional and attention-based Neural Networks against adversarial examples by post-processing an adversarially trained model. Our technique, MeanSparse, cascades the activation functions of a trained model with novel operators that sparsify mean-centered feature vectors. This is equivalent to reducing feature variations around the mean, and we show that such reduced variations merely affect the model's utility, yet they strongly attenuate the adversarial perturbations and decrease the attacker's success rate. Our experiments show that, when applied to the top models in the RobustBench leaderboard, MeanSparse achieves a new robustness record of 75.28% (from 73.71%), 44.78% (from 42.67%) and 62.12% (from 59.56%) on CIFAR-10, CIFAR-100 and ImageNet, respectively, in terms of AutoAttack accuracy. Code is available at https://github.com/SPIN-UMass/MeanSparse
♻ ☆ CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation
Instance segmentation demands costly per-pixel annotations and computationally expensive models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11x smaller student improves over its zero-shot VFM teacher(s) by +8.5 and +7.1 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and further outperforms state-of-the-art SSKD methods on both benchmarks.
♻ ☆ SiLVR: Scalable Lidar-Visual Radiance Field Reconstruction with Uncertainty Quantification
We present a neural radiance field (NeRF) based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photorealistic texture. Our system adopts the state-of-the-art NeRF representation to incorporate lidar. Adding lidar data adds strong geometric constraints on the depth and surface normals, which is particularly useful when modelling uniform texture surfaces which contain ambiguous visual reconstruction cues. A key contribution of this work is a novel method to quantify the epistemic uncertainty of the lidar-visual NeRF reconstruction by estimating the spatial variance of each point location in the radiance field given the sensor observations from the cameras and lidar. This provides a principled approach to evaluate the contribution of each sensor modality to the final reconstruction. In this way, reconstructions that are uncertain (due to e.g. uniform visual texture, limited observation viewpoints, or little lidar coverage) can be identified and removed. Our system is integrated with a real-time lidar SLAM system which is used to bootstrap a Structure-from-Motion (SfM) reconstruction procedure. It also helps to properly constrain the overall metric scale which is essential for the lidar depth loss. The refined SLAM trajectory can then be divided into submaps using Spectral Clustering to group sets of co-visible images together. This submapping approach is more suitable for visual reconstruction than distance-based partitioning. Our uncertainty estimation is particularly effective when merging submaps as their boundaries often contain artefacts due to limited observations. We demonstrate the reconstruction system using a multi-camera, lidar sensor suite in experiments involving both robot-mounted and handheld scanning. Our test datasets cover a total area of more than 20,000 square metres.
comment: Accepted by T-RO. Webpage: https://dynamic.robots.ox.ac.uk/projects/silvr/
♻ ☆ Redundant Semantic Environment Filling via Misleading-Learning for Fair Deepfake Detection
Detecting falsified faces generated by Deepfake technology is essential for safeguarding trust in digital communication and protecting individuals. However, current detectors often suffer from a dual-overfitting: they become overly specialized in both specific forgery fingerprints and particular demographic attributes. Critically, most existing methods overlook the latter issue, which results in poor fairness: faces from certain demographic groups, such as different genders or ethnicities, are consequently more difficult to reliably detect. To address this challenge, we propose a novel strategy called misleading-learning, which populates the latent space with a multitude of redundant environments. By exposing the detector to a sufficiently rich and balanced variety of high-level information for demographic fairness, our approach mitigates demographic bias while maintaining a high detection performance level. We conduct extensive evaluations on fairness, intra-domain detection, cross-domain generalization, and robustness. Experimental results demonstrate that our framework achieves superior fairness and generalization compared to state-of-the-art approaches.
comment: 19 pages, 9 figures
♻ ☆ MORPH: Shape-agnostic PDE Foundation Models
We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D--3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.
Machine Learning 267
☆ Artificial Hippocampus Networks for Efficient Long-Context Modeling
Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer's KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and Gated DeltaNet. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: https://github.com/ByteDance-Seed/AHN.
comment: Code: https://github.com/ByteDance-Seed/AHN
☆ Vibe Checker: Aligning Code Evaluation with Human Preference
Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.
comment: Preprint
☆ h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning
Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.
comment: Preprint, 31 pages, 8 figures
☆ MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline
While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.
☆ Cocoon: A System Architecture for Differentially Private Training with Correlated Noises
Machine learning (ML) models memorize and leak training data, causing serious privacy issues to data owners. Training algorithms with differential privacy (DP), such as DP-SGD, have been gaining attention as a solution. However, DP-SGD adds a noise at each training iteration, which degrades the accuracy of the trained model. To improve accuracy, a new family of approaches adds carefully designed correlated noises, so that noises cancel out each other across iterations. We performed an extensive characterization study of these new mechanisms, for the first time to the best of our knowledge, and show they incur non-negligible overheads when the model is large or uses large embedding tables. Motivated by the analysis, we propose Cocoon, a hardware-software co-designed framework for efficient training with correlated noises. Cocoon accelerates models with embedding tables through pre-computing and storing correlated noises in a coalesced format (Cocoon-Emb), and supports large models through a custom near-memory processing device (Cocoon-NMP). On a real system with an FPGA-based NMP device prototype, Cocoon improves the performance by 2.33-10.82x(Cocoon-Emb) and 1.55-3.06x (Cocoon-NMP).
☆ On the Convergence of Moral Self-Correction in Large Language Models
Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
comment: 19pages, 7 figures
☆ MolGA: Molecular Graph Adaptation with Pre-trained 2D Graph Encoder
Molecular graph representation learning is widely used in chemical and biomedical research. While pre-trained 2D graph encoders have demonstrated strong performance, they overlook the rich molecular domain knowledge associated with submolecular instances (atoms and bonds). While molecular pre-training approaches incorporate such knowledge into their pre-training objectives, they typically employ designs tailored to a specific type of knowledge, lacking the flexibility to integrate diverse knowledge present in molecules. Hence, reusing widely available and well-validated pre-trained 2D encoders, while incorporating molecular domain knowledge during downstream adaptation, offers a more practical alternative. In this work, we propose MolGA, which adapts pre-trained 2D graph encoders to downstream molecular applications by flexibly incorporating diverse molecular domain knowledge. First, we propose a molecular alignment strategy that bridge the gap between pre-trained topological representations with domain-knowledge representations. Second, we introduce a conditional adaptation mechanism that generates instance-specific tokens to enable fine-grained integration of molecular domain knowledge for downstream tasks. Finally, we conduct extensive experiments on eleven public datasets, demonstrating the effectiveness of MolGA.
comment: Under review
☆ Evolutionary Profiles for Protein Fitness Prediction
Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at https://github.com/aim-uofa/EvoIF.
☆ GTCN-G: A Residual Graph-Temporal Fusion Network for Imbalanced Intrusion Detection (Preprint)
The escalating complexity of network threats and the inherent class imbalance in traffic data present formidable challenges for modern Intrusion Detection Systems (IDS). While Graph Neural Networks (GNNs) excel in modeling topological structures and Temporal Convolutional Networks (TCNs) are proficient in capturing time-series dependencies, a framework that synergistically integrates both while explicitly addressing data imbalance remains an open challenge. This paper introduces a novel deep learning framework, named Gated Temporal Convolutional Network and Graph (GTCN-G), engineered to overcome these limitations. Our model uniquely fuses a Gated TCN (G-TCN) for extracting hierarchical temporal features from network flows with a Graph Convolutional Network (GCN) designed to learn from the underlying graph structure. The core innovation lies in the integration of a residual learning mechanism, implemented via a Graph Attention Network (GAT). This mechanism preserves original feature information through residual connections, which is critical for mitigating the class imbalance problem and enhancing detection sensitivity for rare malicious activities (minority classes). We conducted extensive experiments on two public benchmark datasets, UNSW-NB15 and ToN-IoT, to validate our approach. The empirical results demonstrate that the proposed GTCN-G model achieves state-of-the-art performance, significantly outperforming existing baseline models in both binary and multi-class classification tasks.
comment: This preprint was submitted to IEEE TrustCom 2025. The accepted version will be published under copyright 2025 IEEE
☆ Online Rubrics Elicitation from Pairwise Comparisons
Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
☆ Dynamic Regret Bounds for Online Omniprediction with Long Term Constraints
We present an algorithm guaranteeing dynamic regret bounds for online omniprediction with long term constraints. The goal in this recently introduced problem is for a learner to generate a sequence of predictions which are broadcast to a collection of downstream decision makers. Each decision maker has their own utility function, as well as a vector of constraint functions, each mapping their actions and an adversarially selected state to reward or constraint violation terms. The downstream decision makers select actions "as if" the state predictions are correct, and the goal of the learner is to produce predictions such that all downstream decision makers choose actions that give them worst-case utility guarantees while minimizing worst-case constraint violation. Within this framework, we give the first algorithm that obtains simultaneous \emph{dynamic regret} guarantees for all of the agents -- where regret for each agent is measured against a potentially changing sequence of actions across rounds of interaction, while also ensuring vanishing constraint violation for each agent. Our results do not require the agents themselves to maintain any state -- they only solve one-round constrained optimization problems defined by the prediction made at that round.
☆ Test-Time Graph Search for Goal-Conditioned Reinforcement Learning
Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.
☆ Discriminative Feature Feedback with General Teacher Classes
We study the theoretical properties of the interactive learning protocol Discriminative Feature Feedback (DFF) (Dasgupta et al., 2018). The DFF learning protocol uses feedback in the form of discriminative feature explanations. We provide the first systematic study of DFF in a general framework that is comparable to that of classical protocols such as supervised learning and online learning. We study the optimal mistake bound of DFF in the realizable and the non-realizable settings, and obtain novel structural results, as well as insights into the differences between Online Learning and settings with richer feedback such as DFF. We characterize the mistake bound in the realizable setting using a new notion of dimension. In the non-realizable setting, we provide a mistake upper bound and show that it cannot be improved in general. Our results show that unlike Online Learning, in DFF the realizable dimension is insufficient to characterize the optimal non-realizable mistake bound or the existence of no-regret algorithms.
☆ Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
comment: 20 pages
☆ A Broader View of Thompson Sampling
Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, a key conceptual tool is introduced, which we refer to as "faithful" stationarization of the regret formulation. Essentially, the finite horizon dynamic optimization problem is converted into a stationary counterpart which "closely resembles" the original objective (in contrast, the classical infinite horizon discounted formulation, that leads to the Gittins index, alters the problem and objective in too significant a manner). The newly crafted time invariant objective can be studied using Bellman's principle which leads to a time invariant optimal policy. When viewed through this lens, Thompson Sampling admits a simple online optimization form that mimics the structure of the Bellman-optimal policy, and where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation. This answers the question of how Thompson Sampling balances exploration-exploitation, and moreover, provides a principled framework to study and further improve Thompson's original idea.
Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts
Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process is ``guided'' by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.
☆ An in-depth look at approximation via deep and narrow neural networks
In 2017, Hanin and Sellke showed that the class of arbitrarily deep, real-valued, feed-forward and ReLU-activated networks of width w forms a dense subset of the space of continuous functions on R^n, with respect to the topology of uniform convergence on compact sets, if and only if w>n holds. To show the necessity, a concrete counterexample function f:R^n->R was used. In this note we actually approximate this very f by neural networks in the two cases w=n and w=n+1 around the aforementioned threshold. We study how the approximation quality behaves if we vary the depth and what effect (spoiler alert: dying neurons) cause that behavior.
comment: 11 pages
☆ Accelerating Inference for Multilayer Neural Networks with Quantum Computers
Fault-tolerant Quantum Processing Units (QPUs) promise to deliver exponential speed-ups in select computational tasks, yet their integration into modern deep learning pipelines remains unclear. In this work, we take a step towards bridging this gap by presenting the first fully-coherent quantum implementation of a multilayer neural network with non-linear activation functions. Our constructions mirror widely used deep learning architectures based on ResNet, and consist of residual blocks with multi-filter 2D convolutions, sigmoid activations, skip-connections, and layer normalizations. We analyse the complexity of inference for networks under three quantum data access regimes. Without any assumptions, we establish a quadratic speedup over classical methods for shallow bilinear-style networks. With efficient quantum access to the weights, we obtain a quartic speedup over classical methods. With efficient quantum access to both the inputs and the network weights, we prove that a network with an $N$-dimensional vectorized input, $k$ residual block layers, and a final residual-linear-pooling layer can be implemented with an error of $\epsilon$ with $O(\text{polylog}(N/\epsilon)^k)$ inference cost.
☆ Covert Quantum Learning: Privately and Verifiably Learning from Quantum Data
Quantum learning from remotely accessed quantum compute and data must address two key challenges: verifying the correctness of data and ensuring the privacy of the learner's data-collection strategies and resulting conclusions. The covert (verifiable) learning model of Canetti and Karchmer (TCC 2021) provides a framework for endowing classical learning algorithms with such guarantees. In this work, we propose models of covert verifiable learning in quantum learning theory and realize them without computational hardness assumptions for remote data access scenarios motivated by established quantum data advantages. We consider two privacy notions: (i) strategy-covertness, where the eavesdropper does not gain information about the learner's strategy; and (ii) target-covertness, where the eavesdropper does not gain information about the unknown object being learned. We show: Strategy-covert algorithms for making quantum statistical queries via classical shadows; Target-covert algorithms for learning quadratic functions from public quantum examples and private quantum statistical queries, for Pauli shadow tomography and stabilizer state learning from public multi-copy and private single-copy quantum measurements, and for solving Forrelation and Simon's problem from public quantum queries and private classical queries, where the adversary is a unidirectional or i.i.d. ancilla-free eavesdropper. The lattermost results in particular establish that the exponential separation between classical and quantum queries for Forrelation and Simon's problem survives under covertness constraints. Along the way, we design covert verifiable protocols for quantum data acquisition from public quantum queries which may be of independent interest. Overall, our models and corresponding algorithms demonstrate that quantum advantages are privately and verifiably achievable even with untrusted, remote data.
comment: 16 + 54 pages
☆ Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.
☆ Resolution scaling governs DINOv3 transfer performance in chest radiograph classification
Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta's DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.
☆ Split Conformal Classification with Unsupervised Calibration
Methods for split conformal prediction leverage calibration samples to transform any prediction rule into a set-prediction rule that complies with a target coverage probability. Existing methods provide remarkably strong performance guarantees with minimal computational costs. However, they require to use calibration samples composed by labeled examples different to those used for training. This requirement can be highly inconvenient, as it prevents the use of all labeled examples for training and may require acquiring additional labels solely for calibration. This paper presents an effective methodology for split conformal prediction with unsupervised calibration for classification tasks. In the proposed approach, set-prediction rules are obtained using unsupervised calibration samples together with supervised training samples previously used to learn the classification rule. Theoretical and experimental results show that the presented methods can achieve performance comparable to that with supervised calibration, at the expenses of a moderate degradation in performance guarantees and computational efficiency.
☆ Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging
We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.
☆ Bayesian Portfolio Optimization by Predictive Synthesis
Portfolio optimization is a critical task in investment. Most existing portfolio optimization methods require information on the distribution of returns of the assets that make up the portfolio. However, such distribution information is usually unknown to investors. Various methods have been proposed to estimate distribution information, but their accuracy greatly depends on the uncertainty of the financial markets. Due to this uncertainty, a model that could well predict the distribution information at one point in time may perform less accurately compared to another model at a different time. To solve this problem, we investigate a method for portfolio optimization based on Bayesian predictive synthesis (BPS), one of the Bayesian ensemble methods for meta-learning. We assume that investors have access to multiple asset return prediction models. By using BPS with dynamic linear models to combine these predictions, we can obtain a Bayesian predictive posterior about the mean rewards of assets that accommodate the uncertainty of the financial markets. In this study, we examine how to construct mean-variance portfolios and quantile-based portfolios based on the predicted distribution information.
Quantifying Data Contamination in Psychometric Evaluations of LLMs
Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.
comment: 12 pages, 1 figure
☆ NurseLLM: The First Specialized Language Model for Nursing
Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.
comment: EMNLP 2025 Industry Track
☆ ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL
Real-world robotic agents must act under partial observability and long horizons, where key cues may appear long before they affect decision making. However, most modern approaches rely solely on instantaneous information, without incorporating insights from the past. Standard recurrent or transformer models struggle with retaining and leveraging long-term dependencies: context windows truncate history, while naive memory extensions fail under scale and sparsity. We propose ELMUR (External Layer Memory with Update/Rewrite), a transformer architecture with structured external memory. Each layer maintains memory embeddings, interacts with them via bidirectional cross-attention, and updates them through an Least Recently Used (LRU) memory module using replacement or convex blending. ELMUR extends effective horizons up to 100,000 times beyond the attention window and achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps. In POPGym, it outperforms baselines on more than half of the tasks. On MIKASA-Robo sparse-reward manipulation tasks with visual observations, it nearly doubles the performance of strong baselines. These results demonstrate that structured, layer-local external memory offers a simple and scalable approach to decision making under partial observability.
comment: 22 pages, 7 figures
☆ A Multi-Agent Framework for Stateful Inference-Time Search
Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.
Spectral Graph Clustering under Differential Privacy: Balancing Privacy, Accuracy, and Efficiency
We study the problem of spectral graph clustering under edge differential privacy (DP). Specifically, we develop three mechanisms: (i) graph perturbation via randomized edge flipping combined with adjacency matrix shuffling, which enforces edge privacy while preserving key spectral properties of the graph. Importantly, shuffling considerably amplifies the guarantees: whereas flipping edges with a fixed probability alone provides only a constant epsilon edge DP guarantee as the number of nodes grows, the shuffled mechanism achieves (epsilon, delta) edge DP with parameters that tend to zero as the number of nodes increase; (ii) private graph projection with additive Gaussian noise in a lower-dimensional space to reduce dimensionality and computational complexity; and (iii) a noisy power iteration method that distributes Gaussian noise across iterations to ensure edge DP while maintaining convergence. Our analysis provides rigorous privacy guarantees and a precise characterization of the misclassification error rate. Experiments on synthetic and real-world networks validate our theoretical analysis and illustrate the practical privacy-utility trade-offs.
☆ DPMM-CFL: Clustered Federated Learning via Dirichlet Process Mixture Model Nonparametric Clustering
Clustered Federated Learning (CFL) improves performance under non-IID client heterogeneity by clustering clients and training one model per cluster, thereby balancing between a global model and fully personalized models. However, most CFL methods require the number of clusters K to be fixed a priori, which is impractical when the latent structure is unknown. We propose DPMM-CFL, a CFL algorithm that places a Dirichlet Process (DP) prior over the distribution of cluster parameters. This enables nonparametric Bayesian inference to jointly infer both the number of clusters and client assignments, while optimizing per-cluster federated objectives. This results in a method where, at each round, federated updates and cluster inferences are coupled, as presented in this paper. The algorithm is validated on benchmark datasets under Dirichlet and class-split non-IID partitions.
comment: 5 pages, 2 figures
☆ TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning
Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
☆ The Contingencies of Physical Embodiment Allow for Open-Endedness and Care
Physical vulnerability and mortality are often seen as obstacles to be avoided in the development of artificial agents, which struggle to adapt to open-ended environments and provide aligned care. Meanwhile, biological organisms survive, thrive, and care for each other in an open-ended physical world with relative ease and efficiency. Understanding the role of the conditions of life in this disparity can aid in developing more robust, adaptive, and caring artificial agents. Here we define two minimal conditions for physical embodiment inspired by the existentialist phenomenology of Martin Heidegger: being-in-the-world (the agent is a part of the environment) and being-towards-death (unless counteracted, the agent drifts toward terminal states due to the second law of thermodynamics). We propose that from these conditions we can obtain both a homeostatic drive - aimed at maintaining integrity and avoiding death by expending energy to learn and act - and an intrinsic drive to continue to do so in as many ways as possible. Drawing inspiration from Friedrich Nietzsche's existentialist concept of will-to-power, we examine how intrinsic drives to maximize control over future states, e.g., empowerment, allow agents to increase the probability that they will be able to meet their future homeostatic needs, thereby enhancing their capacity to maintain physical integrity. We formalize these concepts within a reinforcement learning framework, which enables us to examine how intrinsically driven embodied agents learning in open-ended multi-agent environments may cultivate the capacities for open-endedness and care.ov
comment: 15 pages, 1 figure
☆ GNN-enhanced Traffic Anomaly Detection for Next-Generation SDN-Enabled Consumer Electronics
Consumer electronics (CE) connected to the Internet of Things are susceptible to various attacks, including DDoS and web-based threats, which can compromise their functionality and facilitate remote hijacking. These vulnerabilities allow attackers to exploit CE for broader system attacks while enabling the propagation of malicious code across the CE network, resulting in device failures. Existing deep learning-based traffic anomaly detection systems exhibit high accuracy in traditional network environments but are often overly complex and reliant on static infrastructure, necessitating manual configuration and management. To address these limitations, we propose a scalable network model that integrates Software-defined Networking (SDN) and Compute First Networking (CFN) for next-generation CE networks. In this network model, we propose a Graph Neural Networks-based Network Anomaly Detection framework (GNN-NAD) that integrates SDN-based CE networks and enables the CFN architecture. GNN-NAD uniquely fuses a static, vulnerability-aware attack graph with dynamic traffic features, providing a holistic view of network security. The core of the framework is a GNN model (GSAGE) for graph representation learning, followed by a Random Forest (RF) classifier. This design (GSAGE+RF) demonstrates superior performance compared to existing feature selection methods. Experimental evaluations on CE environment reveal that GNN-NAD achieves superior metrics in accuracy, recall, precision, and F1 score, even with small sample sizes, exceeding the performance of current network anomaly detection methods. This work advances the security and efficiency of next-generation intelligent CE networks.
comment: This paper has been accepted for publication in IEEE Transactions on Consumer Electronics. 10 pages, 6 figures
☆ Active Control of Turbulent Airfoil Flows Using Adjoint-based Deep Learning
We train active neural-network flow controllers using a deep learning PDE augmentation method to optimize lift-to-drag ratios in turbulent airfoil flows at Reynolds number $5\times10^4$ and Mach number 0.4. Direct numerical simulation and large eddy simulation are employed to model compressible, unconfined flow over two- and three-dimensional semi-infinite NACA 0012 airfoils at angles of attack $\alpha = 5^\circ$, $10^\circ$, and $15^\circ$. Control actions, implemented through a blowing/suction jet at a fixed location and geometry on the upper surface, are adaptively determined by a neural network that maps local pressure measurements to optimal jet total pressure, enabling a sensor-informed control policy that responds spatially and temporally to unsteady flow conditions. The sensitivities of the flow to the neural network parameters are computed using the adjoint Navier-Stokes equations, which we construct using automatic differentiation applied to the flow solver. The trained flow controllers significantly improve the lift-to-drag ratios and reduce flow separation for both two- and three-dimensional airfoil flows, especially at $\alpha = 5^\circ$ and $10^\circ$. The 2D-trained models remain effective when applied out-of-sample to 3D flows, which demonstrates the robustness of the adjoint-trained control approach. The 3D-trained models capture the flow dynamics even more effectively, which leads to better energy efficiency and comparable performance for both adaptive (neural network) and offline (simplified, constant-pressure) controllers. These results underscore the effectiveness of this learning-based approach in improving aerodynamic performance.
Diffusion-Augmented Reinforcement Learning for Robust Portfolio Optimization under Stress Scenarios
In the ever-changing and intricate landscape of financial markets, portfolio optimisation remains a formidable challenge for investors and asset managers. Conventional methods often struggle to capture the complex dynamics of market behaviour and align with diverse investor preferences. To address this, we propose an innovative framework, termed Diffusion-Augmented Reinforcement Learning (DARL), which synergistically integrates Denoising Diffusion Probabilistic Models (DDPMs) with Deep Reinforcement Learning (DRL) for portfolio management. By leveraging DDPMs to generate synthetic market crash scenarios conditioned on varying stress intensities, our approach significantly enhances the robustness of training data. Empirical evaluations demonstrate that DARL outperforms traditional baselines, delivering superior risk-adjusted returns and resilience against unforeseen crises, such as the 2025 Tariff Crisis. This work offers a robust and practical methodology to bolster stress resilience in DRL-driven financial applications.
☆ Non-Asymptotic Analysis of Efficiency in Conformalized Regression
Conformal prediction provides prediction sets with coverage guarantees. The informativeness of conformal prediction depends on its efficiency, typically quantified by the expected size of the prediction set. Prior work on the efficiency of conformalized regression commonly treats the miscoverage level $\alpha$ as a fixed constant. In this work, we establish non-asymptotic bounds on the deviation of the prediction set length from the oracle interval length for conformalized quantile and median regression trained via SGD, under mild assumptions on the data distribution. Our bounds of order $\mathcal{O}(1/\sqrt{n} + 1/(\alpha^2 n) + 1/\sqrt{m} + \exp(-\alpha^2 m))$ capture the joint dependence of efficiency on the proper training set size $n$, the calibration set size $m$, and the miscoverage level $\alpha$. The results identify phase transitions in convergence rates across different regimes of $\alpha$, offering guidance for allocating data to control excess prediction set length. Empirical results are consistent with our theoretical findings.
☆ Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report
World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.
comment: 6 pages, 3 figures, 1X world model challenge technical report
☆ Explaining Models under Multivariate Bernoulli Distribution via Hoeffding Decomposition
Explaining the behavior of predictive models with random inputs can be achieved through sub-models decomposition, where such sub-models have easier interpretable features. Arising from the uncertainty quantification community, recent results have demonstrated the existence and uniqueness of a generalized Hoeffding decomposition for such predictive models when the stochastic input variables are correlated, based on concepts of oblique projection onto L 2 subspaces. This article focuses on the case where the input variables have Bernoulli distributions and provides a complete description of this decomposition. We show that in this case the underlying L 2 subspaces are one-dimensional and that the functional decomposition is explicit. This leads to a complete interpretability framework and theoretically allows reverse engineering. Explicit indicators of the influence of inputs on the output prediction (exemplified by Sobol' indices and Shapley effects) can be explicitly derived. Illustrated by numerical experiments, this type of analysis proves useful for addressing decision-support problems, based on binary decision diagrams, Boolean networks or binary neural networks. The article outlines perspectives for exploring high-dimensional settings and, beyond the case of binary inputs, extending these findings to models with finite countable inputs.
☆ Non-Stationary Online Structured Prediction with Surrogate Losses
Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. Therein the surrogate regret -- the cumulative excess of the target loss (e.g., 0-1 loss) over the surrogate loss (e.g., logistic loss) of the fixed best estimator -- has gained attention, particularly because it often admits a finite bound independent of the time horizon $T$. However, such guarantees break down in non-stationary environments, where every fixed estimator may incur the surrogate loss growing linearly with $T$. We address this by proving a bound of the form $F_T + C(1 + P_T)$ on the cumulative target loss, where $F_T$ is the cumulative surrogate loss of any comparator sequence, $P_T$ is its path length, and $C > 0$ is some constant. This bound depends on $T$ only through $F_T$ and $P_T$, often yielding much stronger guarantees in non-stationary environments. Our core idea is to synthesize the dynamic regret bound of the online gradient descent (OGD) with the technique of exploiting the surrogate gap. Our analysis also sheds light on a new Polyak-style learning rate for OGD, which systematically offers target-loss guarantees and exhibits promising empirical performance. We further extend our approach to a broader class of problems via the convolutional Fenchel--Young loss. Finally, we prove a lower bound showing that the dependence on $F_T$ and $P_T$ is tight.
☆ HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting
Transformer-based methods have achieved impressive results in time series forecasting. However, existing Transformers still exhibit limitations in sequence modeling as they tend to overemphasize temporal dependencies. This incurs additional computational overhead without yielding corresponding performance gains. We find that the performance of Transformers is highly dependent on the embedding method used to learn effective representations. To address this issue, we extract multivariate features to augment the effective information captured in the embedding layer, yielding multidimensional embeddings that convey richer and more meaningful sequence representations. These representations enable Transformer-based forecasters to better understand the series. Specifically, we introduce Hybrid Temporal and Multivariate Embeddings (HTME). The HTME extractor integrates a lightweight temporal feature extraction module with a carefully designed multivariate feature extraction module to provide complementary features, thereby achieving a balance between model complexity and performance. By combining HTME with the Transformer architecture, we present HTMformer, leveraging the enhanced feature extraction capability of the HTME extractor to build a lightweight forecaster. Experiments conducted on eight real-world datasets demonstrate that our approach outperforms existing baselines in both accuracy and efficiency.
☆ Pseudo-MDPs: A Novel Framework for Efficiently Optimizing Last Revealer Seed Manipulations in Blockchains
This study tackles the computational challenges of solving Markov Decision Processes (MDPs) for a restricted class of problems. It is motivated by the Last Revealer Attack (LRA), which undermines fairness in some Proof-of-Stake (PoS) blockchains such as Ethereum (\$400B market capitalization). We introduce pseudo-MDPs (pMDPs) a framework that naturally models such problems and propose two distinct problem reductions to standard MDPs. One problem reduction provides a novel, counter-intuitive perspective, and combining the two problem reductions enables significant improvements in dynamic programming algorithms such as value iteration. In the case of the LRA which size is parameterized by $\kappa$ (in Ethereum's case $\kappa$ = 32), we reduce the computational complexity from O(2^$\kappa$ $\kappa$^2^($\kappa$+2)) to O($\kappa$^4) (per iteration). This solution also provide the usual benefits from Dynamic Programming solutions: exponentially fast convergence toward the optimal solution is guaranteed. The dual perspective also simplifies policy extraction, making the approach well-suited for resource-constrained agents who can operate with very limited memory and computation once the problem has been solved. Furthermore, we generalize those results to a broader class of MDPs, enhancing their applicability. The framework is validated through two case studies: a fictional card game and the LRA on the Ethereum random seed consensus protocol. These applications demonstrate the framework's ability to solve large-scale problems effectively while offering actionable insights into optimal strategies. This work advances the study of MDPs and contributes to understanding security vulnerabilities in blockchain systems.
☆ Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention. By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments. This generalisation capability is expected to enable robots to solve novel downstream tasks with minimal or no additional task-specific data, facilitating more flexible and scalable real-world deployment. Unlike previous surveys that focus narrowly on action representations or high-level model architectures, this work offers a comprehensive, full-stack review, integrating both software and hardware components of VLA systems. In particular, this paper provides a systematic review of VLAs, covering their strategy and architectural transition, architectures and building blocks, modality-specific processing techniques, and learning paradigms. In addition, to support the deployment of VLAs in real-world robotic applications, we also review commonly used robot platforms, data collection strategies, publicly available datasets, data augmentation methods, and evaluation benchmarks. Throughout this comprehensive survey, this paper aims to offer practical guidance for the robotics community in applying VLAs to real-world robotic systems. All references categorized by training approach, evaluation method, modality, and dataset are available in the table on our project website: https://vla-survey.github.io .
comment: Accepted to IEEE Access, website: https://vla-survey.github.io
☆ Blind Construction of Angular Power Maps in Massive MIMO Networks
Channel state information (CSI) acquisition is a challenging problem in massive multiple-input multiple-output (MIMO) networks. Radio maps provide a promising solution for radio resource management by reducing online CSI acquisition. However, conventional approaches for radio map construction require location-labeled CSI data, which is challenging in practice. This paper investigates unsupervised angular power map construction based on large timescale CSI data collected in a massive MIMO network without location labels. A hidden Markov model (HMM) is built to connect the hidden trajectory of a mobile with the CSI evolution of a massive MIMO channel. As a result, the mobile location can be estimated, enabling the construction of an angular power map. We show that under uniform rectilinear mobility with Poisson-distributed base stations (BSs), the Cramer-Rao Lower Bound (CRLB) for localization error can vanish at any signal-to-noise ratios (SNRs), whereas when BSs are confined to a limited region, the error remains nonzero even with infinite independent measurements. Based on reference signal received power (RSRP) data collected in a real multi-cell massive MIMO network, an average localization error of 18 meters can be achieved although measurements are mainly obtained from a single serving cell.
☆ Introspection in Learned Semantic Scene Graph Localisation
This work investigates how semantics influence localisation performance and robustness in a learned self-supervised, contrastive semantic localisation framework. After training a localisation network on both original and perturbed maps, we conduct a thorough post-hoc introspection analysis to probe whether the model filters environmental noise and prioritises distinctive landmarks over routine clutter. We validate various interpretability methods and present a comparative reliability analysis. Integrated gradients and Attention Weights consistently emerge as the most reliable probes of learned behaviour. A semantic class ablation further reveals an implicit weighting in which frequent objects are often down-weighted. Overall, the results indicate that the model learns noise-robust, semantically salient relations about place definition, thereby enabling explainable registration under challenging visual and structural variations.
comment: IEEE IROS 2025 Workshop FAST
☆ Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation
We propose a workflow for speech emotion recognition (SER) that combines pre-trained representations with automated hyperparameter optimisation (HPO). Using SpeechBrain wav2vec2-base model fine-tuned on IEMOCAP as the encoder, we compare two HPO strategies, Gaussian Process Bayesian Optimisation (GP-BO) and Tree-structured Parzen Estimators (TPE), under an identical four-dimensional search space and 15-trial budget, with balanced class accuracy (BCA) on the German EmoDB corpus as the objective. All experiments run on 8 CPU cores with 32 GB RAM. GP-BO achieves 0.96 BCA in 11 minutes, and TPE (Hyperopt implementation) attains 0.97 in 15 minutes. In contrast, grid search requires 143 trials and 1,680 minutes to exceed 0.9 BCA, and the best AutoSpeech 2020 baseline reports only 0.85 in 30 minutes on GPU. For cross-lingual generalisation, an EmoDB-trained HPO-tuned model improves zero-shot accuracy by 0.25 on CREMA-D and 0.26 on RAVDESS. Results show that efficient HPO with pre-trained encoders delivers competitive SER on commodity CPUs. Source code to this work is available at: https://github.com/youngaryan/speechbrain-emotion-hpo.
☆ COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization
Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.
☆ Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration CIKM 2025
Molecular representation learning plays a crucial role in advancing applications such as drug discovery and material design. Existing work leverages 2D and 3D modalities of molecular information for pre-training, aiming to capture comprehensive structural and geometric insights. However, these methods require paired 2D and 3D molecular data to train the model effectively and prevent it from collapsing into a single modality, posing limitations in scenarios where a certain modality is unavailable or computationally expensive to generate. To overcome this limitation, we propose FlexMol, a flexible molecule pre-training framework that learns unified molecular representations while supporting single-modality input. Specifically, inspired by the unified structure in vision-language models, our approach employs separate models for 2D and 3D molecular data, leverages parameter sharing to improve computational efficiency, and utilizes a decoder to generate features for the missing modality. This enables a multistage continuous learning process where both modalities contribute collaboratively during training, while ensuring robustness when only one modality is available during inference. Extensive experiments demonstrate that FlexMol achieves superior performance across a wide range of molecular property prediction tasks, and we also empirically demonstrate its effectiveness with incomplete data. Our code and data are available at https://github.com/tewiSong/FlexMol.
comment: CIKM 2025
☆ Federated Unlearning in the Wild: Rethinking Fairness and Data Discrepancy
Machine unlearning is critical for enforcing data deletion rights like the "right to be forgotten." As a decentralized paradigm, Federated Learning (FL) also requires unlearning, but realistic implementations face two major challenges. First, fairness in Federated Unlearning (FU) is often overlooked. Exact unlearning methods typically force all clients into costly retraining, even those uninvolved. Approximate approaches, using gradient ascent or distillation, make coarse interventions that can unfairly degrade performance for clients with only retained data. Second, most FU evaluations rely on synthetic data assumptions (IID/non-IID) that ignore real-world heterogeneity. These unrealistic benchmarks obscure the true impact of unlearning and limit the applicability of current methods. We first conduct a comprehensive benchmark of existing FU methods under realistic data heterogeneity and fairness conditions. We then propose a novel, fairness-aware FU approach, Federated Cross-Client-Constrains Unlearning (FedCCCU), to explicitly address both challenges. FedCCCU offers a practical and scalable solution for real-world FU. Experimental results show that existing methods perform poorly in realistic settings, while our approach consistently outperforms them.
☆ Native Hybrid Attention for Efficient Sequence Modeling
Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \& inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.
comment: Technical report, 16 pages
☆ Sharpness-Aware Data Generation for Zero-shot Quantization
Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing the full-precision model. While it is well-known that deep neural networks with low sharpness have better generalization ability, none of the previous zero-shot quantization works considers the sharpness of the quantized model as a criterion for generating training data. This paper introduces a novel methodology that takes into account quantized model sharpness in synthetic data generation to enhance generalization. Specifically, we first demonstrate that sharpness minimization can be attained by maximizing gradient matching between the reconstruction loss gradients computed on synthetic and real validation data, under certain assumptions. We then circumvent the problem of the gradient matching without real validation set by approximating it with the gradient matching between each generated sample and its neighbors. Experimental evaluations on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed method over the state-of-the-art techniques in low-bit quantization settings.
☆ Root Cause Analysis of Outliers in Unknown Cyclic Graphs
We study the propagation of outliers in cyclic causal graphs with linear structural equations, tracing them back to one or several "root cause" nodes. We show that it is possible to identify a short list of potential root causes provided that the perturbation is sufficiently strong and propagates according to the same structural equations as in the normal mode. This shortlist consists of the true root causes together with those of its parents lying on a cycle with the root cause. Notably, our method does not require prior knowledge of the causal graph.
☆ Spiral Model Technique For Data Science & Machine Learning Lifecycle
Analytics play an important role in modern business. Companies adapt data science lifecycles to their culture to seek productivity and improve their competitiveness among others. Data science lifecycles are fairly an important contributing factor to start and end a project that are data dependent. Data science and Machine learning life cycles comprises of series of steps that are involved in a project. A typical life cycle states that it is a linear or cyclical model that revolves around. It is mostly depicted that it is possible in a traditional data science life cycle to start the process again after reaching the end of cycle. This paper suggests a new technique to incorporate data science life cycle to business problems that have a clear end goal. A new technique called spiral technique is introduced to emphasize versatility, agility and iterative approach to business processes.
☆ Revisiting Mixout: An Overlooked Path to Robust Finetuning
Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.
☆ Relational Database Distillation: From Structured Tables to Condensed Graph Data
Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances leverage graph representation learning to capture complex inter-table relations as multi-hop dependencies. Despite achieving state-of-the-art performance, these methods remain hindered by the prohibitive storage overhead and excessive training time, due to the massive scale of the database and the computational burden of intensive message passing across interconnected tables. To alleviate these concerns, we propose and study the problem of Relational Database Distillation (RDD). Specifically, we aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the predictive power (i.e., utility) required for training graph-based models. Multi-modal column information is preserved through node features, and primary-foreign key relations are encoded via heterogeneous edges, thereby maintaining both data fidelity and relational structure. To ensure adaptability across diverse downstream tasks without engaging the traditional, inefficient bi-level distillation framework, we further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph. Extensive experiments on multiple real-world RDBs demonstrate that our solution substantially reduces the data size while maintaining competitive performance on classification and regression tasks, creating an effective pathway for scalable learning with RDBs.
☆ Falsification-Driven Reinforcement Learning for Maritime Motion Planning
Compliance with maritime traffic rules is essential for the safe operation of autonomous vessels, yet training reinforcement learning (RL) agents to adhere to them is challenging. The behavior of RL agents is shaped by the training scenarios they encounter, but creating scenarios that capture the complexity of maritime navigation is non-trivial, and real-world data alone is insufficient. To address this, we propose a falsification-driven RL approach that generates adversarial training scenarios in which the vessel under test violates maritime traffic rules, which are expressed as signal temporal logic specifications. Our experiments on open-sea navigation with two vessels demonstrate that the proposed approach provides more relevant training scenarios and achieves more consistent rule compliance.
☆ Accelerating Sparse Ternary GEMM for Quantized LLM inference on Apple Silicon
Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.
☆ High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization WACV 2026
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
comment: WACV 2026: Winter Conference on Applications of Computer Vision 2026
☆ From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.
☆ Grouped Differential Attention
The self-attention mechanism, while foundational to modern Transformer architectures, suffers from a critical inefficiency: it frequently allocates substantial attention to redundant or noisy context. Differential Attention addressed this by using subtractive attention maps for signal and noise, but its required balanced head allocation imposes rigid constraints on representational flexibility and scalability. To overcome this, we propose Grouped Differential Attention (GDA), a novel approach that introduces unbalanced head allocation between signal-preserving and noise-control groups. GDA significantly enhances signal focus by strategically assigning more heads to signal extraction and fewer to noise-control, stabilizing the latter through controlled repetition (akin to GQA). This design achieves stronger signal fidelity with minimal computational overhead. We further extend this principle to group-differentiated growth, a scalable strategy that selectively replicates only the signal-focused heads, thereby ensuring efficient capacity expansion. Through large-scale pretraining and continual training experiments, we demonstrate that moderate imbalance ratios in GDA yield substantial improvements in generalization and stability compared to symmetric baselines. Our results collectively establish that ratio-aware head allocation and selective expansion offer an effective and practical path toward designing scalable, computation-efficient Transformer architectures.
☆ Fisher Information, Training and Bias in Fourier Regression Models
Motivated by the growing interest in quantum machine learning, in particular quantum neural networks (QNNs), we study how recently introduced evaluation metrics based on the Fisher information matrix (FIM) are effective for predicting their training and prediction performance. We exploit the equivalence between a broad class of QNNs and Fourier models, and study the interplay between the \emph{effective dimension} and the \emph{bias} of a model towards a given task, investigating how these affect the model's training and performance. We show that for a model that is completely agnostic, or unbiased, towards the function to be learned, a higher effective dimension likely results in a better trainability and performance. On the other hand, for models that are biased towards the function to be learned a lower effective dimension is likely beneficial during training. To obtain these results, we derive an analytical expression of the FIM for Fourier models and identify the features controlling a model's effective dimension. This allows us to construct models with tunable effective dimension and bias, and to compare their training. We furthermore introduce a tensor network representation of the considered Fourier models, which could be a tool of independent interest for the analysis of QNN models. Overall, these findings provide an explicit example of the interplay between geometrical properties, model-task alignment and training, which are relevant for the broader machine learning community.
☆ Revisiting Node Affinity Prediction in Temporal Graphs
Node affinity prediction is a common task that is widely used in temporal graph learning with applications in social and financial networks, recommender systems, and more. Recent works have addressed this task by adapting state-of-the-art dynamic link property prediction models to node affinity prediction. However, simple heuristics, such as Persistent Forecast or Moving Average, outperform these models. In this work, we analyze the challenges in training current Temporal Graph Neural Networks for node affinity prediction and suggest appropriate solutions. Combining the solutions, we develop NAViS - Node Affinity prediction model using Virtual State, by exploiting the equivalence between heuristics and state space models. While promising, training NAViS is non-trivial. Therefore, we further introduce a novel loss function for node affinity prediction. We evaluate NAViS on TGB and show that it outperforms the state-of-the-art, including heuristics. Our source code is available at https://github.com/orfeld415/NAVIS
comment: preprint
☆ PyCFRL: A Python library for counterfactually fair offline reinforcement learning via sequential data preprocessing
Reinforcement learning (RL) aims to learn and evaluate a sequential decision rule, often referred to as a "policy", that maximizes the population-level benefit in an environment across possibly infinitely many time steps. However, the sequential decisions made by an RL algorithm, while optimized to maximize overall population benefits, may disadvantage certain individuals who are in minority or socioeconomically disadvantaged groups. To address this problem, we introduce PyCFRL, a Python library for ensuring counterfactual fairness in offline RL. PyCFRL implements a novel data preprocessing algorithm for learning counterfactually fair RL policies from offline datasets and provides tools to evaluate the values and counterfactual unfairness levels of RL policies. We describe the high-level functionalities of PyCFRL and demonstrate one of its major use cases through a data example. The library is publicly available on PyPI and Github (https://github.com/JianhanZhang/PyCFRL), and detailed tutorials can be found in the PyCFRL documentation (https://pycfrl-documentation.netlify.app).
☆ Textual interpretation of transient image classifications from large language models
Modern astronomical surveys deliver immense volumes of transient detections, yet distinguishing real astrophysical signals (for example, explosive events) from bogus imaging artefacts remains a challenge. Convolutional neural networks are effectively used for real versus bogus classification; however, their reliance on opaque latent representations hinders interpretability. Here we show that large language models (LLMs) can approach the performance level of a convolutional neural network on three optical transient survey datasets (Pan-STARRS, MeerLICHT and ATLAS) while simultaneously producing direct, human-readable descriptions for every candidate. Using only 15 examples and concise instructions, Google's LLM, Gemini, achieves a 93% average accuracy across datasets that span a range of resolution and pixel scales. We also show that a second LLM can assess the coherence of the output of the first model, enabling iterative refinement by identifying problematic cases. This framework allows users to define the desired classification behaviour through natural language and examples, bypassing traditional training pipelines. Furthermore, by generating textual descriptions of observed features, LLMs enable users to query classifications as if navigating an annotated catalogue, rather than deciphering abstract latent spaces. As next-generation telescopes and surveys further increase the amount of data available, LLM-based classification could help bridge the gap between automated detection and transparent, human-level understanding.
comment: Published in Nature Astronomy (2025). Publisher's Version of Record (CC BY 4.0). DOI: 10.1038/s41550-025-02670-z
☆ Quantum Sparse Recovery and Quantum Orthogonal Matching Pursuit
We study quantum sparse recovery in non-orthogonal, overcomplete dictionaries: given coherent quantum access to a state and a dictionary of vectors, the goal is to reconstruct the state up to $\ell_2$ error using as few vectors as possible. We first show that the general recovery problem is NP-hard, ruling out efficient exact algorithms in full generality. To overcome this, we introduce Quantum Orthogonal Matching Pursuit (QOMP), the first quantum analogue of the classical OMP greedy algorithm. QOMP combines quantum subroutines for inner product estimation, maximum finding, and block-encoded projections with an error-resetting design that avoids iteration-to-iteration error accumulation. Under standard mutual incoherence and well-conditioned sparsity assumptions, QOMP provably recovers the exact support of a $K$-sparse state in polynomial time. As an application, we give the first framework for sparse quantum tomography with non-orthogonal dictionaries in $\ell_2$ norm, achieving query complexity $\widetilde{O}(\sqrt{N}/\epsilon)$ in favorable regimes and reducing tomography to estimating only $K$ coefficients instead of $N$ amplitudes. In particular, for pure-state tomography with $m=O(N)$ dictionary vectors and sparsity $K=\widetilde{O}(1)$ on a well-conditioned subdictionary, this circumvents the $\widetilde{\Omega}(N/\epsilon)$ lower bound that holds in the dense, orthonormal-dictionary setting, without contradiction, by leveraging sparsity together with non-orthogonality. Beyond tomography, we analyze QOMP in the QRAM model, where it yields polynomial speedups over classical OMP implementations, and provide a quantum algorithm to estimate the mutual incoherence of a dictionary of $m$ vectors in $O(m/\epsilon)$ queries, improving over both deterministic and quantum-inspired classical methods.
☆ Bayesian Nonparametric Dynamical Clustering of Time Series
We present a method that models the evolution of an unbounded number of time series clusters by switching among an unknown number of regimes with linear dynamics. We develop a Bayesian non-parametric approach using a hierarchical Dirichlet process as a prior on the parameters of a Switching Linear Dynamical System and a Gaussian process prior to model the statistical variations in amplitude and temporal alignment within each cluster. By modeling the evolution of time series patterns, the method avoids unnecessary proliferation of clusters in a principled manner. We perform inference by formulating a variational lower bound for off-line and on-line scenarios, enabling efficient learning through optimization. We illustrate the versatility and effectiveness of the approach through several case studies of electrocardiogram analysis using publicly available databases.
comment: This work has been submitted to the IEEE for possible publication. 15 pages. 9 figures
☆ DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning
Realistic traffic simulation is critical for the development of autonomous driving systems and urban mobility planning, yet existing imitation learning approaches often fail to model realistic traffic behaviors. Behavior cloning suffers from covariate shift, while Generative Adversarial Imitation Learning (GAIL) is notoriously unstable in multi-agent settings. We identify a key source of this instability: irrelevant interaction misguidance, where a discriminator penalizes an ego vehicle's realistic behavior due to unrealistic interactions among its neighbors. To address this, we propose Decomposed Multi-agent GAIL (DecompGAIL), which explicitly decomposes realism into ego-map and ego-neighbor components, filtering out misleading neighbor: neighbor and neighbor: map interactions. We further introduce a social PPO objective that augments ego rewards with distance-weighted neighborhood rewards, encouraging overall realism across agents. Integrated into a lightweight SMART-based backbone, DecompGAIL achieves state-of-the-art performance on the WOMD Sim Agents 2025 benchmark.
☆ Utilizing Large Language Models for Machine Learning Explainability
This study explores the explainability capabilities of large language models (LLMs), when employed to autonomously generate machine learning (ML) solutions. We examine two classification tasks: (i) a binary classification problem focused on predicting driver alertness states, and (ii) a multilabel classification problem based on the yeast dataset. Three state-of-the-art LLMs (i.e. OpenAI GPT, Anthropic Claude, and DeepSeek) are prompted to design training pipelines for four common classifiers: Random Forest, XGBoost, Multilayer Perceptron, and Long Short-Term Memory networks. The generated models are evaluated in terms of predictive performance (recall, precision, and F1-score) and explainability using SHAP (SHapley Additive exPlanations). Specifically, we measure Average SHAP Fidelity (Mean Squared Error between SHAP approximations and model outputs) and Average SHAP Sparsity (number of features deemed influential). The results reveal that LLMs are capable of producing effective and interpretable models, achieving high fidelity and consistent sparsity, highlighting their potential as automated tools for interpretable ML pipeline generation. The results show that LLMs can produce effective, interpretable pipelines with high fidelity and consistent sparsity, closely matching manually engineered baselines.
☆ Vacuum Spiker: A Spiking Neural Network-Based Model for Efficient Anomaly Detection in Time Series
Anomaly detection is a key task across domains such as industry, healthcare, and cybersecurity. Many real-world anomaly detection problems involve analyzing multiple features over time, making time series analysis a natural approach for such problems. While deep learning models have achieved strong performance in this field, their trend to exhibit high energy consumption limits their deployment in resource-constrained environments such as IoT devices, edge computing platforms, and wearables. To address this challenge, this paper introduces the \textit{Vacuum Spiker algorithm}, a novel Spiking Neural Network-based method for anomaly detection in time series. It incorporates a new detection criterion that relies on global changes in neural activity rather than reconstruction or prediction error. It is trained using Spike Time-Dependent Plasticity in a novel way, intended to induce changes in neural activity when anomalies occur. A new efficient encoding scheme is also proposed, which discretizes the input space into non-overlapping intervals, assigning each to a single neuron. This strategy encodes information with a single spike per time step, improving energy efficiency compared to conventional encoding methods. Experimental results on publicly available datasets show that the proposed algorithm achieves competitive performance while significantly reducing energy consumption, compared to a wide set of deep learning and machine learning baselines. Furthermore, its practical utility is validated in a real-world case study, where the model successfully identifies power curtailment events in a solar inverter. These results highlight its potential for sustainable and efficient anomaly detection.
comment: 53 pages, 16 figures, preprint submitted to a journal for review
☆ Angular Constraint Embedding via SpherePair Loss for Constrained Clustering NeurIPS 2025
Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \href{https://github.com/spherepaircc/SpherePairCC/tree/main}{our repository}.
comment: Accepted by NeurIPS 2025, 6 Figures and 1 Table in Main text, 18 Figures and 5 Tables in Appendices
☆ Multi-Dimensional Autoscaling of Stream Processing Services on Edge Devices
Edge devices have limited resources, which inevitably leads to situations where stream processing services cannot satisfy their needs. While existing autoscaling mechanisms focus entirely on resource scaling, Edge devices require alternative ways to sustain the Service Level Objectives (SLOs) of competing services. To address these issues, we introduce a Multi-dimensional Autoscaling Platform (MUDAP) that supports fine-grained vertical scaling across both service- and resource-level dimensions. MUDAP supports service-specific scaling tailored to available parameters, e.g., scale data quality or model size for a particular service. To optimize the execution across services, we present a scaling agent based on Regression Analysis of Structural Knowledge (RASK). The RASK agent efficiently explores the solution space and learns a continuous regression model of the processing environment for inferring optimal scaling actions. We compared our approach with two autoscalers, the Kubernetes VPA and a reinforcement learning agent, for scaling up to 9 services on a single Edge device. Our results showed that RASK can infer an accurate regression model in merely 20 iterations (i.e., observe 200s of processing). By increasingly adding elasticity dimensions, RASK sustained the highest request load with 28% less SLO violations, compared to baselines.
☆ MoRE-GNN: Multi-omics Data Integration with a Heterogeneous Graph Autoencoder
The integration of multi-omics single-cell data remains challenging due to high-dimensionality and complex inter-modality relationships. To address this, we introduce MoRE-GNN (Multi-omics Relational Edge Graph Neural Network), a heterogeneous graph autoencoder that combines graph convolution and attention mechanisms to dynamically construct relational graphs directly from data. Evaluations on six publicly available datasets demonstrate that MoRE-GNN captures biologically meaningful relationships and outperforms existing methods, particularly in settings with strong inter-modality correlations. Furthermore, the learned representations allow for accurate downstream cross-modal predictions. While performance may vary with dataset complexity, MoRE-GNN offers an adaptive, scalable and interpretable framework for advancing multi-omics integration.
☆ SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.
☆ Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Retrieval
We consider image transmission via deep joint source-channel coding (DeepJSCC) over multi-hop additive white Gaussian noise (AWGN) channels by training a DeepJSCC encoder-decoder pair with a pre-trained deep hash distillation (DHD) module to semantically cluster images, facilitating security-oriented applications through enhanced semantic consistency and improving the perceptual reconstruction quality. We train the DeepJSCC module to both reduce mean square error (MSE) and minimize cosine distance between DHD hashes of source and reconstructed images. Significantly improved perceptual quality as a result of semantic alignment is illustrated for different multi-hop settings, for which classical DeepJSCC may suffer from noise accumulation, measured by the learned perceptual image patch similarity (LPIPS) metric.
☆ Towards Generalization of Graph Neural Networks for AC Optimal Power Flow
AC Optimal Power Flow (ACOPF) is computationally expensive for large-scale power systems, with conventional solvers requiring prohibitive solution times. Machine learning approaches offer computational speedups but struggle with scalability and topology adaptability without expensive retraining. To enable scalability across grid sizes and adaptability to topology changes, we propose a Hybrid Heterogeneous Message Passing Neural Network (HH-MPNN). HH-MPNN models buses, generators, loads, shunts, transmission lines and transformers as distinct node or edge types, combined with a scalable transformer model for handling long-range dependencies. On grids from 14 to 2,000 buses, HH-MPNN achieves less than 1% optimality gap on default topologies. Applied zero-shot to thousands of unseen topologies, HH-MPNN achieves less than 3% optimality gap despite training only on default topologies. Pre-training on smaller grids also improves results on a larger grid. Computational speedups reach 1,000x to 10,000x compared to interior point solvers. These results advance practical, generalizable machine learning for real-time power system operations.
comment: Pre-print has been submitted for review
☆ Enhancing Bankruptcy Prediction of Banks through Advanced Machine Learning Techniques: An Innovative Approach and Analysis
Context: Financial system stability is determined by the condition of the banking system. A bank failure can destroy the stability of the financial system, as banks are subject to systemic risk, affecting not only individual banks but also segments or the entire financial system. Calculating the probability of a bank going bankrupt is one way to ensure the banking system is safe and sound. Existing literature and limitations: Statistical models, such as Altman's Z-Score, are one of the common techniques for developing a bankruptcy prediction model. However, statistical methods rely on rigid and sometimes irrelevant assumptions, which can result in low forecast accuracy. New approaches are necessary. Objective of the research: Bankruptcy models are developed using machine learning techniques, such as logistic regression (LR), random forest (RF), and support vector machines (SVM). According to several studies, machine learning is also more accurate and effective than statistical methods for categorising and forecasting banking risk management. Present Research: The commercial bank data are derived from the annual financial statements of 44 active banks and 21 bankrupt banks in Turkey from 1994 to 2004, and the rural bank data are derived from the quarterly financial reports of 43 active and 43 bankrupt rural banks in Indonesia between 2013 and 2019. Five rural banks in Indonesia have also been selected to demonstrate the feasibility of analysing bank bankruptcy trends. Findings and implications: The results of the research experiments show that RF can forecast data from commercial banks with a 90% accuracy rate. Furthermore, the three machine learning methods proposed accurately predict the likelihood of rural bank bankruptcy. Contribution and Conclusion: The proposed innovative machine learning approach help to implement policies that reduce the costs of bankruptcy.
☆ Reconquering Bell sampling on qudits: stabilizer learning and testing, quantum pseudorandomness bounds, and more
Bell sampling is a simple yet powerful tool based on measuring two copies of a quantum state in the Bell basis, and has found applications in a plethora of problems related to stabiliser states and measures of magic. However, it was not known how to generalise the procedure from qubits to $d$-level systems -- qudits -- for all dimensions $d > 2$ in a useful way. Indeed, a prior work of the authors (arXiv'24) showed that the natural extension of Bell sampling to arbitrary dimensions fails to provide meaningful information about the quantum states being measured. In this paper, we overcome the difficulties encountered in previous works and develop a useful generalisation of Bell sampling to qudits of all $d\geq 2$. At the heart of our primitive is a new unitary, based on Lagrange's four-square theorem, that maps four copies of any stabiliser state $|\mathcal{S}\rangle$ to four copies of its complex conjugate $|\mathcal{S}^\ast\rangle$ (up to some Pauli operator), which may be of independent interest. We then demonstrate the utility of our new Bell sampling technique by lifting several known results from qubits to qudits for any $d\geq 2$: 1. Learning stabiliser states in $O(n^3)$ time with $O(n)$ samples; 2. Solving the Hidden Stabiliser Group Problem in $\tilde{O}(n^3/\varepsilon)$ time with $\tilde{O}(n/\varepsilon)$ samples; 3. Testing whether $|\psi\rangle$ has stabiliser size at least $d^t$ or is $\varepsilon$-far from all such states in $\tilde{O}(n^3/\varepsilon)$ time with $\tilde{O}(n/\varepsilon)$ samples; 4. Clifford circuits with at most $n/2$ single-qudit non-Clifford gates cannot prepare pseudorandom states; 5. Testing whether $|\psi\rangle$ has stabiliser fidelity at least $1-\varepsilon_1$ or at most $1-\varepsilon_2$ with $O(d^2/\varepsilon_2)$ samples if $\varepsilon_1 = 0$ or $O(d^2/\varepsilon_2^2)$ samples if $\varepsilon_1 = O(d^{-2})$.
comment: 51 pages, 1 figure. Comments are welcome
☆ CNN-TFT explained by SHAP with multi-head attention weights for time series forecasting
Convolutional neural networks (CNNs) and transformer architectures offer strengths for modeling temporal data: CNNs excel at capturing local patterns and translational invariances, while transformers effectively model long-range dependencies via self-attention. This paper proposes a hybrid architecture integrating convolutional feature extraction with a temporal fusion transformer (TFT) backbone to enhance multivariate time series forecasting. The CNN module first applies a hierarchy of one-dimensional convolutional layers to distill salient local patterns from raw input sequences, reducing noise and dimensionality. The resulting feature maps are then fed into the TFT, which applies multi-head attention to capture both short- and long-term dependencies and to weigh relevant covariates adaptively. We evaluate the CNN-TFT on a hydroelectric natural flow time series dataset. Experimental results demonstrate that CNN-TFT outperforms well-established deep learning models, with a mean absolute percentage error of up to 2.2%. The explainability of the model is obtained by a proposed Shapley additive explanations with multi-head attention weights (SHAP-MHAW). Our novel architecture, named CNN-TFT-SHAP-MHAW, is promising for applications requiring high-fidelity, multivariate time series forecasts, being available for future analysis at https://github.com/SFStefenon/CNN-TFT-SHAP-MHAW .
☆ Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
Attention is a core operation in numerous machine learning and artificial intelligence models. This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors, particularly those based on the RISC-V instruction set architecture (ISA). This work represents the first effort to vectorize FlashAttention, minimizing scalar code and simplifying the computational complexity of evaluating exponentials needed by softmax used in attention. By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function without the need to extend baseline vector ISA with new custom instructions. Also, appropriate tiling strategies are explored with the goal to improve memory locality. Experimental results highlight the scalability of our approach, demonstrating significant performance gains with the vectorized implementations when processing attention layers in practical applications.
☆ Early wind turbine alarm prediction based on machine learning: AlarmForecasting
Alarm data is pivotal in curbing fault behavior in Wind Turbines (WTs) and forms the backbone for advancedpredictive monitoring systems. Traditionally, research cohorts have been confined to utilizing alarm data solelyas a diagnostic tool, merely indicative of unhealthy status. However, this study aims to offer a transformativeleap towards preempting alarms, preventing alarms from triggering altogether, and consequently avertingimpending failures. Our proposed Alarm Forecasting and Classification (AFC) framework is designed on twosuccessive modules: first, the regression module based on long short-term memory (LSTM) for time-series alarmforecasting, and thereafter, the classification module to implement alarm tagging on the forecasted alarm. Thisway, the entire alarm taxonomy can be forecasted reliably rather than a few specific alarms. 14 Senvion MM82turbines with an operational period of 5 years are used as a case study; the results demonstrated 82%, 52%,and 41% accurate forecasts for 10, 20, and 30 min alarm forecasts, respectively. The results substantiateanticipating and averting alarms, which is significant in curbing alarm frequency and enhancing operationalefficiency through proactive intervention.
comment: International Journal of Electrical Power and Energy Systems
☆ Recurrence-Complete Frame-based Action Models
In recent years, attention-like mechanisms have been used to great success in the space of large language models, unlocking scaling potential to a previously unthinkable extent. "Attention Is All You Need" famously claims RNN cells are not needed in conjunction with attention. We challenge this view. In this paper, we point to existing proofs that architectures with fully parallelizable forward or backward passes cannot represent classes of problems specifically interesting for long-running agentic tasks. We further conjecture a critical time t beyond which non-recurrence-complete models fail to aggregate inputs correctly, with concrete implications for agentic systems (e.g., software engineering agents). To address this, we introduce a recurrence-complete architecture and train it on GitHub-derived action sequences. Loss follows a power law in the trained sequence length while the parameter count remains fixed. Moreover, longer-sequence training always amortizes its linearly increasing wall-time cost, yielding lower loss as a function of wall time.
Efficient numeracy in language models through single-token number embeddings
To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either limiting the numerical intuition of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.
comment: preprint
☆ The Unreasonable Effectiveness of Randomized Representations in Online Continual Graph Learning
Catastrophic forgetting is one of the main obstacles for Online Continual Graph Learning (OCGL), where nodes arrive one by one, distribution drifts may occur at any time and offline training on task-specific subgraphs is not feasible. In this work, we explore a surprisingly simple yet highly effective approach for OCGL: we use a fixed, randomly initialized encoder to generate robust and expressive node embeddings by aggregating neighborhood information, training online only a lightweight classifier. By freezing the encoder, we eliminate drifts of the representation parameters, a key source of forgetting, obtaining embeddings that are both expressive and stable. When evaluated across several OCGL benchmarks, despite its simplicity and lack of memory buffer, this approach yields consistent gains over state-of-the-art methods, with surprising improvements of up to 30% and performance often approaching that of the joint offline-training upper bound. These results suggest that in OCGL, catastrophic forgetting can be minimized without complex replay or regularization by embracing architectural simplicity and stability.
☆ BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods
The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.
comment: The 8th BlackboxNLP Workshop (Shared Task), 6 pages
☆ Quantum Computing Methods for Malware Detection
In this paper, we explore the potential of quantum computing in enhancing malware detection through the application of Quantum Machine Learning (QML). Our main objective is to investigate the performance of the Quantum Support Vector Machine (QSVM) algorithm compared to SVM. A publicly available dataset containing raw binaries of Portable Executable (PE) files was used for the classification. The QSVM algorithm, incorporating quantum kernels through different feature maps, was implemented and evaluated on a local simulator within the Qiskit SDK and IBM quantum computers. Experimental results from simulators and quantum hardware provide insights into the behavior and performance of quantum computers, especially in handling large-scale computations for malware detection tasks. The work summarizes the practical experience with using quantum hardware via the Qiskit interfaces. We describe in detail the critical issues encountered, as well as the fixes that had to be developed and applied to the base code of the Qiskit Machine Learning library. These issues include missing transpilation of the circuits submitted to IBM Quantum systems and exceeding the maximum job size limit due to the submission of all the circuits in one job.
comment: 22 pages, 2 figures, 3 tables
☆ Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the attacked data's components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization, while RL finetuning and protracted reasoning are not critical. For example, increasing emphasis on defensive specifications via prompting lowers the success rate of gradient-based multimodal attacks on VLMs robustified by adversarial pretraining, but this same intervention provides no such benefit to not-robustified models. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Accordingly, we advise layering train-time and test-time defenses to obtain their synergistic benefit.
comment: 17 pages
☆ Modeling COVID-19 Dynamics in German States Using Physics-Informed Neural Networks
The COVID-19 pandemic has highlighted the need for quantitative modeling and analysis to understand real-world disease dynamics. In particular, post hoc analyses using compartmental models offer valuable insights into the effectiveness of public health interventions, such as vaccination strategies and containment policies. However, such compartmental models like SIR (Susceptible-Infectious-Recovered) often face limitations in directly incorporating noisy observational data. In this work, we employ Physics-Informed Neural Networks (PINNs) to solve the inverse problem of the SIR model using infection data from the Robert Koch Institute (RKI). Our main contribution is a fine-grained, spatio-temporal analysis of COVID-19 dynamics across all German federal states over a three-year period. We estimate state-specific transmission and recovery parameters and time-varying reproduction number (R_t) to track the pandemic progression. The results highlight strong variations in transmission behavior across regions, revealing correlations with vaccination uptake and temporal patterns associated with major pandemic phases. Our findings demonstrate the utility of PINNs in localized, long-term epidemiological modeling.
comment: 19 pages, 7 figures, 2 tables
☆ Function regression using the forward forward training and inferring paradigm
Function regression/approximation is a fundamental application of machine learning. Neural networks (NNs) can be easily trained for function regression using a sufficient number of neurons and epochs. The forward-forward learning algorithm is a novel approach for training neural networks without backpropagation, and is well suited for implementation in neuromorphic computing and physical analogs for neural networks. To the best of the authors' knowledge, the Forward Forward paradigm of training and inferencing NNs is currently only restricted to classification tasks. This paper introduces a new methodology for approximating functions (function regression) using the Forward-Forward algorithm. Furthermore, the paper evaluates the developed methodology on univariate and multivariate functions, and provides preliminary studies of extending the proposed Forward-Forward regression to Kolmogorov Arnold Networks, and Deep Physical Neural Networks.
comment: Keywords: Neural Networks, Forward Forward training, Function Regression, Physical Neural Networks, Analog Computing
☆ MultiCNKG: Integrating Cognitive Neuroscience, Gene, and Disease Knowledge Graphs Using Large Language Models
The advent of large language models (LLMs) has revolutionized the integration of knowledge graphs (KGs) in biomedical and cognitive sciences, overcoming limitations in traditional machine learning methods for capturing intricate semantic links among genes, diseases, and cognitive processes. We introduce MultiCNKG, an innovative framework that merges three key knowledge sources: the Cognitive Neuroscience Knowledge Graph (CNKG) with 2.9K nodes and 4.3K edges across 9 node types and 20 edge types; Gene Ontology (GO) featuring 43K nodes and 75K edges in 3 node types and 4 edge types; and Disease Ontology (DO) comprising 11.2K nodes and 8.8K edges with 1 node type and 2 edge types. Leveraging LLMs like GPT-4, we conduct entity alignment, semantic similarity computation, and graph augmentation to create a cohesive KG that interconnects genetic mechanisms, neurological disorders, and cognitive functions. The resulting MultiCNKG encompasses 6.9K nodes across 5 types (e.g., Genes, Diseases, Cognitive Processes) and 11.3K edges spanning 7 types (e.g., Causes, Associated with, Regulates), facilitating a multi-layered view from molecular to behavioral domains. Assessments using metrics such as precision (85.20%), recall (87.30%), coverage (92.18%), graph consistency (82.50%), novelty detection (40.28%), and expert validation (89.50%) affirm its robustness and coherence. Link prediction evaluations with models like TransE (MR: 391, MRR: 0.411) and RotatE (MR: 263, MRR: 0.395) show competitive performance against benchmarks like FB15k-237 and WN18RR. This KG advances applications in personalized medicine, cognitive disorder diagnostics, and hypothesis formulation in cognitive neuroscience.
☆ Incorporating Expert Knowledge into Bayesian Causal Discovery of Mixtures of Directed Acyclic Graphs
Bayesian causal discovery benefits from prior information elicited from domain experts, and in heterogeneous domains any prior knowledge would be badly needed. However, so far prior elicitation approaches have assumed a single causal graph and hence are not suited to heterogeneous domains. We propose a causal elicitation strategy for heterogeneous settings, based on Bayesian experimental design (BED) principles, and a variational mixture structure learning (VaMSL) method -- extending the earlier differentiable Bayesian structure learning (DiBS) method -- to iteratively infer mixtures of causal Bayesian networks (CBNs). We construct an informative graph prior incorporating elicited expert feedback in the inference of mixtures of CBNs. Our proposed method successfully produces a set of alternative causal models (mixture components or clusters), and achieves an improved structure learning performance on heterogeneous synthetic data when informed by a simulated expert. Finally, we demonstrate that our approach is capable of capturing complex distributions in a breast cancer database.
comment: 28 pages, 18 figures
☆ Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management
We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
☆ Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding them in external knowledge. However, its application in sensitive domains is limited by privacy risks. Existing private RAG methods typically rely on query-time differential privacy (DP), which requires repeated noise injection and leads to accumulated privacy loss. To address this issue, we propose DP-SynRAG, a framework that uses LLMs to generate differentially private synthetic RAG databases. Unlike prior methods, the synthetic text can be reused once created, thereby avoiding repeated noise injection and additional privacy costs. To preserve essential information for downstream RAG tasks, DP-SynRAG extends private prediction, which instructs LLMs to generate text that mimics subsampled database records in a DP manner. Experiments show that DP-SynRAG achieves superior performanec to the state-of-the-art private RAG systems while maintaining a fixed privacy budget, offering a scalable solution for privacy-preserving RAG.
comment: Under review
☆ Dual Goal Representations
In this work, we introduce dual goal representations for goal-conditioned reinforcement learning (GCRL). A dual goal representation characterizes a state by "the set of temporal distances from all other states"; in other words, it encodes a state through its relations to every other state, measured by temporal distance. This representation provides several appealing theoretical properties. First, it depends only on the intrinsic dynamics of the environment and is invariant to the original state representation. Second, it contains provably sufficient information to recover an optimal goal-reaching policy, while being able to filter out exogenous noise. Based on this concept, we develop a practical goal representation learning method that can be combined with any existing GCRL algorithm. Through diverse experiments on the OGBench task suite, we empirically show that dual goal representations consistently improve offline goal-reaching performance across 20 state- and pixel-based tasks.
☆ Inefficiencies of Meta Agents for Agent Design
Recent works began to automate the design of agentic systems using meta-agents that propose and iteratively refine new agent architectures. In this paper, we examine three key challenges in a common class of meta-agents. First, we investigate how a meta-agent learns across iterations and find that simply expanding the context with all previous agents, as proposed by previous works, performs worse than ignoring prior designs entirely. We show that the performance improves with an evolutionary approach. Second, although the meta-agent designs multiple agents during training, it typically commits to a single agent at test time. We find that the designed agents have low behavioral diversity, limiting the potential for their complementary use. Third, we assess when automated design is economically viable. We find that only in a few cases--specifically, two datasets--the overall cost of designing and deploying the agents is lower than that of human-designed agents when deployed on over 15,000 examples. In contrast, the performance gains for other datasets do not justify the design cost, regardless of scale.
☆ A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking NeurIPS 2025
Generating realistic time series data is critical for applications in healthcare, finance, and science. However, irregular sampling and missing values present significant challenges. While prior methods address these irregularities, they often yield suboptimal results and incur high computational costs. Recent advances in regular time series generation, such as the diffusion-based ImagenTime model, demonstrate strong, fast, and scalable generative capabilities by transforming time series into image representations, making them a promising solution. However, extending ImagenTime to irregular sequences using simple masking introduces "unnatural" neighborhoods, where missing values replaced by zeros disrupt the learning process. To overcome this, we propose a novel two-step framework: first, a Time Series Transformer completes irregular sequences, creating natural neighborhoods; second, a vision-based diffusion model with masking minimizes dependence on the completed values. This approach leverages the strengths of both completion and masking, enabling robust and efficient generation of realistic time series. Our method achieves state-of-the-art performance, achieving a relative improvement in discriminative score by $70\%$ and in computational cost by $85\%$. Code is at https://github.com/azencot-group/ImagenI2R.
comment: Accepted to NeurIPS 2025; The first two authors contributed equally and are co-leading authors
☆ Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks
In recent years, the growing interest in Large Language Models (LLMs) has significantly advanced prompt engineering, transitioning from manual design to model-based optimization. Prompts for LLMs generally comprise two components: the \textit{instruction}, which defines the task or objective, and the \textit{input}, which is tailored to the instruction type. In natural language generation (NLG) tasks such as machine translation, the \textit{input} component is particularly critical, while the \textit{instruction} component tends to be concise. Existing prompt engineering methods primarily focus on optimizing the \textit{instruction} component for general tasks, often requiring large-parameter LLMs as auxiliary tools. However, these approaches exhibit limited applicability for tasks like machine translation, where the \textit{input} component plays a more pivotal role. To address this limitation, this paper introduces a novel prompt optimization method specifically designed for machine translation tasks. The proposed approach employs a small-parameter model trained using a back-translation-based strategy, significantly reducing training overhead for single-task optimization while delivering highly effective performance. With certain adaptations, this method can also be extended to other downstream tasks.
☆ Is the Hard-Label Cryptanalytic Model Extraction Really Polynomial?
Deep Neural Networks (DNNs) have attracted significant attention, and their internal models are now considered valuable intellectual assets. Extracting these internal models through access to a DNN is conceptually similar to extracting a secret key via oracle access to a block cipher. Consequently, cryptanalytic techniques, particularly differential-like attacks, have been actively explored recently. ReLU-based DNNs are the most commonly and widely deployed architectures. While early works (e.g., Crypto 2020, Eurocrypt 2024) assume access to exact output logits, which are usually invisible, more recent works (e.g., Asiacrypt 2024, Eurocrypt 2025) focus on the hard-label setting, where only the final classification result (e.g., "dog" or "car") is available to the attacker. Notably, Carlini et al. (Eurocrypt 2025) demonstrated that model extraction is feasible in polynomial time even under this restricted setting. In this paper, we first show that the assumptions underlying their attack become increasingly unrealistic as the attack-target depth grows. In practice, satisfying these assumptions requires an exponential number of queries with respect to the attack depth, implying that the attack does not always run in polynomial time. To address this critical limitation, we propose a novel attack method called CrossLayer Extraction. Instead of directly extracting the secret parameters (e.g., weights and biases) of a specific neuron, which incurs exponential cost, we exploit neuron interactions across layers to extract this information from deeper layers. This technique significantly reduces query complexity and mitigates the limitations of existing model extraction approaches.
☆ Latent Representation Learning in Heavy-Ion Collisions with MaskPoint Transformer NeurIPS 2025
A central challenge in high-energy nuclear physics is to extract informative features from the high-dimensional final-state data of heavy-ion collisions (HIC) in order to enable reliable downstream analyses. Traditional approaches often rely on selected observables, which may miss subtle but physically relevant structures in the data. To address this, we introduce a Transformer-based autoencoder trained with a two-stage paradigm: self-supervised pre-training followed by supervised fine-tuning. The pretrained encoder learns latent representations directly from unlabeled HIC data, providing a compact and information-rich feature space that can be adapted to diverse physics tasks. As a case study, we apply the method to distinguish between large and small collision systems, where it achieves significantly higher classification accuracy than PointNet. Principal component analysis and SHAP interpretation further demonstrate that the autoencoder captures complex nonlinear correlations beyond individual observables, yielding features with strong discriminative and explanatory power. These results establish our two-stage framework as a general and robust foundation for feature learning in HIC, opening the door to more powerful analyses of quark--gluon plasma properties and other emergent phenomena. The implementation is publicly available at https://github.com/Giovanni-Sforza/MaskPoint-AMPT.
comment: 10 pages, 5 figures, accepted at the NeurIPS 2025 workshop "Machine Learning and the Physical Sciences"
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix
Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.
☆ AutoBalance: An Automatic Balancing Framework for Training Physics-Informed Neural Networks
Physics-Informed Neural Networks (PINNs) provide a powerful and general framework for solving Partial Differential Equations (PDEs) by embedding physical laws into loss functions. However, training PINNs is notoriously difficult due to the need to balance multiple loss terms, such as PDE residuals and boundary conditions, which often have conflicting objectives and vastly different curvatures. Existing methods address this issue by manipulating gradients before optimization (a "pre-combine" strategy). We argue that this approach is fundamentally limited, as forcing a single optimizer to process gradients from spectrally heterogeneous loss landscapes disrupts its internal preconditioning. In this work, we introduce AutoBalance, a novel "post-combine" training paradigm. AutoBalance assigns an independent adaptive optimizer to each loss component and aggregates the resulting preconditioned updates afterwards. Extensive experiments on challenging PDE benchmarks show that AutoBalance consistently outperforms existing frameworks, achieving significant reductions in solution error, as measured by both the MSE and $L^{\infty}$ norms. Moreover, AutoBalance is orthogonal to and complementary with other popular PINN methodologies, amplifying their effectiveness on demanding benchmarks.
comment: 23 pages
☆ Distributed Algorithms for Multi-Agent Multi-Armed Bandits with Collision
We study the stochastic Multiplayer Multi-Armed Bandit (MMAB) problem, where multiple players select arms to maximize their cumulative rewards. Collisions occur when two or more players select the same arm, resulting in no reward, and are observed by the players involved. We consider a distributed setting without central coordination, where each player can only observe their own actions and collision feedback. We propose a distributed algorithm with an adaptive, efficient communication protocol. The algorithm achieves near-optimal group and individual regret, with a communication cost of only $\mathcal{O}(\log\log T)$. Our experiments demonstrate significant performance improvements over existing baselines. Compared to state-of-the-art (SOTA) methods, our approach achieves a notable reduction in individual regret. Finally, we extend our approach to a periodic asynchronous setting, proving the lower bound for this problem and presenting an algorithm that achieves logarithmic regret.
comment: 21 pages, 4 figures
☆ TimeFormer: Transformer with Attention Modulation Empowered by Temporal Characteristics for Time Series Forecasting
Although Transformers excel in natural language processing, their extension to time series forecasting remains challenging due to insufficient consideration of the differences between textual and temporal modalities. In this paper, we develop a novel Transformer architecture designed for time series data, aiming to maximize its representational capacity. We identify two key but often overlooked characteristics of time series: (1) unidirectional influence from the past to the future, and (2) the phenomenon of decaying influence over time. These characteristics are introduced to enhance the attention mechanism of Transformers. We propose TimeFormer, whose core innovation is a self-attention mechanism with two modulation terms (MoSA), designed to capture these temporal priors of time series under the constraints of the Hawkes process and causal masking. Additionally, TimeFormer introduces a framework based on multi-scale and subsequence analysis to capture semantic dependencies at different temporal scales, enriching the temporal dependencies. Extensive experiments conducted on multiple real-world datasets show that TimeFormer significantly outperforms state-of-the-art methods, achieving up to a 7.45% reduction in MSE compared to the best baseline and setting new benchmarks on 94.04\% of evaluation metrics. Moreover, we demonstrate that the MoSA mechanism can be broadly applied to enhance the performance of other Transformer-based models.
☆ Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback
We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents' context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.
comment: Accepted at EMNLP 2025 Industry Track
☆ XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation
Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.
☆ The Effect of Attention Head Count on Transformer Approximation
Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $\epsilon$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/\epsilon^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.
☆ Rethinking Nonlinearity: Trainable Gaussian Mixture Modules for Modern Neural Architectures
Neural networks in general, from MLPs and CNNs to attention-based Transformers, are constructed from layers of linear combinations followed by nonlinear operations such as ReLU, Sigmoid, or Softmax. Despite their strength, these conventional designs are often limited in introducing non-linearity by the choice of activation functions. In this work, we introduce Gaussian Mixture-Inspired Nonlinear Modules (GMNM), a new class of differentiable modules that draw on the universal density approximation Gaussian mixture models (GMMs) and distance properties (metric space) of Gaussian kernal. By relaxing probabilistic constraints and adopting a flexible parameterization of Gaussian projections, GMNM can be seamlessly integrated into diverse neural architectures and trained end-to-end with gradient-based methods. Our experiments demonstrate that incorporating GMNM into architectures such as MLPs, CNNs, attention mechanisms, and LSTMs consistently improves performance over standard baselines. These results highlight GMNM's potential as a powerful and flexible module for enhancing efficiency and accuracy across a wide range of machine learning applications.
☆ Fitzpatrick Thresholding for Skin Image Segmentation MICCAI 2025
Accurate estimation of the body surface area (BSA) involved by a rash, such as psoriasis, is critical for assessing rash severity, selecting an initial treatment regimen, and following clinical treatment response. Attempts at segmentation of inflammatory skin disease such as psoriasis perform markedly worse on darker skin tones, potentially impeding equitable care. We assembled a psoriasis dataset sourced from six public atlases, annotated for Fitzpatrick skin type, and added detailed segmentation masks for every image. Reference models based on U-Net, ResU-Net, and SETR-small are trained without tone information. On the tuning split we sweep decision thresholds and select (i) global optima and (ii) per Fitzpatrick skin tone optima for Dice and binary IoU. Adapting Fitzpatrick specific thresholds lifted segmentation performance for the darkest subgroup (Fitz VI) by up to +31 % bIoU and +24 % Dice on UNet, with consistent, though smaller, gains in the same direction for ResU-Net (+25 % bIoU, +18 % Dice) and SETR-small (+17 % bIoU, +11 % Dice). Because Fitzpatrick skin tone classifiers trained on Fitzpatrick-17k now exceed 95 % accuracy, the cost of skin tone labeling required for this technique has fallen dramatically. Fitzpatrick thresholding is simple, model-agnostic, requires no architectural changes, no re-training, and is virtually cost free. We demonstrate the inclusion of Fitzpatrick thresholding as a potential future fairness baseline.
comment: Accepted to MICCAI 2025 ISIC Workshop. 24 minute Oral presentation given. Awarded "Best Paper - Honorable Mention"
Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions
The Forward-Forward (FF) Algorithm is a recently proposed learning procedure for neural networks that employs two forward passes instead of the traditional forward and backward passes used in backpropagation. However, FF remains largely confined to supervised settings, leaving a gap at domains where learning signals can be yielded more naturally such as RL. In this work, inspired by FF's goodness function using layer activity statistics, we introduce Action-conditioned Root mean squared Q-Functions (ARQ), a novel value estimation method that applies a goodness function and action conditioning for local RL using temporal difference learning. Despite its simplicity and biological grounding, our approach achieves superior performance compared to state-of-the-art local backprop-free RL methods in the MinAtar and the DeepMind Control Suite benchmarks, while also outperforming algorithms trained with backpropagation on most tasks. Code can be found at https://github.com/agentic-learning-ai-lab/arq.
comment: 15 pages, 5 figures
☆ Q-Learning with Fine-Grained Gap-Dependent Regret
We study fine-grained gap-dependent regret bounds for model-free reinforcement learning in episodic tabular Markov Decision Processes. Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps. We address this limitation by establishing fine-grained gap-dependent regret bounds for both UCB-based and non-UCB-based algorithms. In the UCB-based setting, we develop a novel analytical framework that explicitly separates the analysis of optimal and suboptimal state-action pairs, yielding the first fine-grained regret upper bound for UCB-Hoeffding (Jin et al., 2018). To highlight the generality of this framework, we introduce ULCB-Hoeffding, a new UCB-based algorithm inspired by AMB (Xu et al.,2021) but with a simplified structure, which enjoys fine-grained regret guarantees and empirically outperforms AMB. In the non-UCB-based setting, we revisit the only known algorithm AMB, and identify two key issues in its algorithm design and analysis: improper truncation in the $Q$-updates and violation of the martingale difference condition in its concentration argument. We propose a refined version of AMB that addresses these issues, establishing the first rigorous fine-grained gap-dependent regret for a non-UCB-based method, with experiments demonstrating improved performance over AMB.
☆ The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators
A core challenge in scientific machine learning, and scientific computing more generally, is modeling continuous phenomena which (in practice) are represented discretely. Machine-learned operators (MLOs) have been introduced as a means to achieve this modeling goal, as this class of architecture can perform inference at arbitrary resolution. In this work, we evaluate whether this architectural innovation is sufficient to perform "zero-shot super-resolution," namely to enable a model to serve inference on higher-resolution data than that on which it was originally trained. We comprehensively evaluate both zero-shot sub-resolution and super-resolution (i.e., multi-resolution) inference in MLOs. We decouple multi-resolution inference into two key behaviors: 1) extrapolation to varying frequency information; and 2) interpolating across varying resolutions. We empirically demonstrate that MLOs fail to do both of these tasks in a zero-shot manner. Consequently, we find MLOs are not able to perform accurate inference at resolutions different from those on which they were trained, and instead they are brittle and susceptible to aliasing. To address these failure modes, we propose a simple, computationally-efficient, and data-driven multi-resolution training protocol that overcomes aliasing and that provides robust multi-resolution generalization.
☆ A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures
State Space Models (SSMs) have recently emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing, offering linear scaling and lower memory use. Yet, how contextual information flows across layers and tokens in these architectures remains understudied. We present the first unified, token- and layer-level analysis of representation propagation in SSMs and TBMs. Using centered kernel alignment, stability metrics, and probing, we characterize how representations evolve within and across layers. We find a key divergence: TBMs rapidly homogenize token representations, with diversity reemerging only in later layers, while SSMs preserve token uniqueness early but converge to homogenization deeper. Theoretical analysis and parameter randomization further reveal that oversmoothing in TBMs stems from architectural design, whereas in SSMs it arises mainly from training dynamics. These insights clarify the inductive biases of both architectures and inform future model and training designs for long-context reasoning.
☆ Control-Augmented Autoregressive Diffusion for Data Assimilation
Despite recent advances in test-time scaling and finetuning of diffusion models, guidance in Auto-Regressive Diffusion Models (ARDMs) remains underexplored. We introduce an amortized framework that augments pretrained ARDMs with a lightweight controller network, trained offline by previewing future ARDM rollouts and learning stepwise controls that anticipate upcoming observations under a terminal cost objective. We evaluate this framework in the context of data assimilation (DA) for chaotic spatiotemporal partial differential equations (PDEs), a setting where existing methods are often computationally prohibitive and prone to forecast drift under sparse observations. Our approach reduces DA inference to a single forward rollout with on-the-fly corrections, avoiding expensive adjoint computations and/or optimizations during inference. We demonstrate that our method consistently outperforms four state-of-the-art baselines in stability, accuracy, and physical fidelity across two canonical PDEs and six observation regimes. We will release code and checkpoints publicly.
☆ StruSR: Structure-Aware Symbolic Regression with Physics-Informed Taylor Guidance
Symbolic regression aims to find interpretable analytical expressions by searching over mathematical formula spaces to capture underlying system behavior, particularly in scientific modeling governed by physical laws. However, traditional methods lack mechanisms for extracting structured physical priors from time series observations, making it difficult to capture symbolic expressions that reflect the system's global behavior. In this work, we propose a structure-aware symbolic regression framework, called StruSR, that leverages trained Physics-Informed Neural Networks (PINNs) to extract locally structured physical priors from time series data. By performing local Taylor expansions on the outputs of the trained PINN, we obtain derivative-based structural information to guide symbolic expression evolution. To assess the importance of expression components, we introduce a masking-based attribution mechanism that quantifies each subtree's contribution to structural alignment and physical residual reduction. These sensitivity scores steer mutation and crossover operations within genetic programming, preserving substructures with high physical or structural significance while selectively modifying less informative components. A hybrid fitness function jointly minimizes physics residuals and Taylor coefficient mismatch, ensuring consistency with both the governing equations and the local analytical behavior encoded by the PINN. Experiments on benchmark PDE systems demonstrate that StruSR improves convergence speed, structural fidelity, and expression interpretability compared to conventional baselines, offering a principled paradigm for physics-grounded symbolic discovery.
☆ Three Forms of Stochastic Injection for Improved Distribution-to-Distribution Generative Modeling
Modeling transformations between arbitrary data distributions is a fundamental scientific challenge, arising in applications like drug discovery and evolutionary simulation. While flow matching offers a natural framework for this task, its use has thus far primarily focused on the noise-to-data setting, while its application in the general distribution-to-distribution setting is underexplored. We find that in the latter case, where the source is also a data distribution to be learned from limited samples, standard flow matching fails due to sparse supervision. To address this, we propose a simple and computationally efficient method that injects stochasticity into the training process by perturbing source samples and flow interpolants. On five diverse imaging tasks spanning biology, radiology, and astronomy, our method significantly improves generation quality, outperforming existing baselines by an average of 9 FID points. Our approach also reduces the transport cost between input and generated samples to better highlight the true effect of the transformation, making flow matching a more practical tool for simulating the diverse distribution transformations that arise in science.
☆ Chem-NMF: Multi-layer $α$-divergence Non-Negative Matrix Factorization for Cardiorespiratory Disease Clustering, with Improved Convergence Inspired by Chemical Catalysts and Rigorous Asymptotic Analysis
Non-Negative Matrix Factorization (NMF) is an unsupervised learning method offering low-rank representations across various domains such as audio processing, biomedical signal analysis, and image recognition. The incorporation of $\alpha$-divergence in NMF formulations enhances flexibility in optimization, yet extending these methods to multi-layer architectures presents challenges in ensuring convergence. To address this, we introduce a novel approach inspired by the Boltzmann probability of the energy barriers in chemical reactions to theoretically perform convergence analysis. We introduce a novel method, called Chem-NMF, with a bounding factor which stabilizes convergence. To our knowledge, this is the first study to apply a physical chemistry perspective to rigorously analyze the convergence behaviour of the NMF algorithm. We start from mathematically proven asymptotic convergence results and then show how they apply to real data. Experimental results demonstrate that the proposed algorithm improves clustering accuracy by 5.6% $\pm$ 2.7% on biomedical signals and 11.1% $\pm$ 7.2% on face images (mean $\pm$ std).
☆ AI-Driven Forecasting and Monitoring of Urban Water System
Underground water and wastewater pipelines are vital for city operations but plagued by anomalies like leaks and infiltrations, causing substantial water loss, environmental damage, and high repair costs. Conventional manual inspections lack efficiency, while dense sensor deployments are prohibitively expensive. In recent years, artificial intelligence has advanced rapidly and is increasingly applied to urban infrastructure. In this research, we propose an integrated AI and remote-sensor framework to address the challenge of leak detection in underground water pipelines, through deploying a sparse set of remote sensors to capture real-time flow and depth data, paired with HydroNet - a dedicated model utilizing pipeline attributes (e.g., material, diameter, slope) in a directed graph for higher-precision modeling. Evaluations on a real-world campus wastewater network dataset demonstrate that our system collects effective spatio-temporal hydraulic data, enabling HydroNet to outperform advanced baselines. This integration of edge-aware message passing with hydraulic simulations enables accurate network-wide predictions from limited sensor deployments. We envision that this approach can be effectively extended to a wide range of underground water pipeline networks.
☆ Unsupervised Backdoor Detection and Mitigation for Spiking Neural Networks
Spiking Neural Networks (SNNs) have gained increasing attention for their superior energy efficiency compared to Artificial Neural Networks (ANNs). However, their security aspects, particularly under backdoor attacks, have received limited attention. Existing defense methods developed for ANNs perform poorly or can be easily bypassed in SNNs due to their event-driven and temporal dependencies. This paper identifies the key blockers that hinder traditional backdoor defenses in SNNs and proposes an unsupervised post-training detection framework, Temporal Membrane Potential Backdoor Detection (TMPBD), to overcome these challenges. TMPBD leverages the maximum margin statistics of temporal membrane potential (TMP) in the final spiking layer to detect target labels without any attack knowledge or data access. We further introduce a robust mitigation mechanism, Neural Dendrites Suppression Backdoor Mitigation (NDSBM), which clamps dendritic connections between early convolutional layers to suppress malicious neurons while preserving benign behaviors, guided by TMP extracted from a small, clean, unlabeled dataset. Extensive experiments on multiple neuromorphic benchmarks and state-of-the-art input-aware dynamic trigger attacks demonstrate that TMPBD achieves 100% detection accuracy, while NDSBM reduces the attack success rate from 100% to 8.44%, and to 2.81% when combined with detection, without degrading clean accuracy.
comment: To appear in The 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2025)
☆ POME: Post Optimization Model Edit via Muon-style Projection
We introduce Post-Optimization Model Edit (POME), a new algorithm that enhances the performance of fine-tuned large language models using only their pretrained and fine-tuned checkpoints, without requiring extra data or further optimization. The core idea is to apply a muon-style projection to $\Delta W$, the difference between the fine-tuned and pretrained weights. This projection uses truncated singular value decomposition (SVD) to equalize the influence of dominant update directions and prune small singular values, which often represent noise. As a simple post-processing step, POME is completely decoupled from the training pipeline. It requires zero modifications and imposes no overhead, making it universally compatible with any optimizer or distributed framework. POME delivers consistent gains, boosting average performance by +2.5\% on GSM8K and +1.0\% on code generation. Its broad applicability -- from 7B foundation models to 72B RLHF-instructed models -- establishes it as a practical, zero-cost enhancement for any fine-tuning pipeline. Code is available at https://github.com/NUS-HPC-AI-Lab/POME.
☆ DPA-Net: A Dual-Path Attention Neural Network for Inferring Glycemic Control Metrics from Self-Monitored Blood Glucose Data
Continuous glucose monitoring (CGM) provides dense and dynamic glucose profiles that enable reliable estimation of Ambulatory Glucose Profile (AGP) metrics, such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR). However, the high cost and limited accessibility of CGM restrict its widespread adoption, particularly in low- and middle-income regions. In contrast, self-monitoring of blood glucose (SMBG) is inexpensive and widely available but yields sparse and irregular data that are challenging to translate into clinically meaningful glycemic metrics. In this work, we propose a Dual-Path Attention Neural Network (DPA-Net) to estimate AGP metrics directly from SMBG data. DPA-Net integrates two complementary paths: (1) a spatial-channel attention path that reconstructs a CGM-like trajectory from sparse SMBG observations, and (2) a multi-scale ResNet path that directly predicts AGP metrics. An alignment mechanism between the two paths is introduced to reduce bias and mitigate overfitting. In addition, we develop an active point selector to identify realistic and informative SMBG sampling points that reflect patient behavioral patterns. Experimental results on a large, real-world dataset demonstrate that DPA-Net achieves robust accuracy with low errors while reducing systematic bias. To the best of our knowledge, this is the first supervised machine learning framework for estimating AGP metrics from SMBG data, offering a practical and clinically relevant decision-support tool in settings where CGM is not accessible.
comment: 14 pages, 10 figures
☆ FEAorta: A Fully Automated Framework for Finite Element Analysis of the Aorta From 3D CT Images
Aortic aneurysm disease ranks consistently in the top 20 causes of death in the U.S. population. Thoracic aortic aneurysm is manifested as an abnormal bulging of thoracic aortic wall and it is a leading cause of death in adults. From the perspective of biomechanics, rupture occurs when the stress acting on the aortic wall exceeds the wall strength. Wall stress distribution can be obtained by computational biomechanical analyses, especially structural Finite Element Analysis. For risk assessment, probabilistic rupture risk of TAA can be calculated by comparing stress with material strength using a material failure model. Although these engineering tools are currently available for TAA rupture risk assessment on patient specific level, clinical adoption has been limited due to two major barriers: labor intensive 3D reconstruction current patient specific anatomical modeling still relies on manual segmentation, making it time consuming and difficult to scale to a large patient population, and computational burden traditional FEA simulations are resource intensive and incompatible with time sensitive clinical workflows. The second barrier was successfully overcome by our team through the development of the PyTorch FEA library and the FEA DNN integration framework. By incorporating the FEA functionalities within PyTorch FEA and applying the principle of static determinacy, we reduced the FEA based stress computation time to approximately three minutes per case. Moreover, by integrating DNN and FEA through the PyTorch FEA library, our approach further decreases the computation time to only a few seconds per case. This work focuses on overcoming the first barrier through the development of an end to end deep neural network capable of generating patient specific finite element meshes of the aorta directly from 3D CT images.
☆ SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation
The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM
☆ The Framework That Survives Bad Models: Human-AI Collaboration For Clinical Trials
Artificial intelligence (AI) holds great promise for supporting clinical trials, from patient recruitment and endpoint assessment to treatment response prediction. However, deploying AI without safeguards poses significant risks, particularly when evaluating patient endpoints that directly impact trial conclusions. We compared two AI frameworks against human-only assessment for medical image-based disease evaluation, measuring cost, accuracy, robustness, and generalization ability. To stress-test these frameworks, we injected bad models, ranging from random guesses to naive predictions, to ensure that observed treatment effects remain valid even under severe model degradation. We evaluated the frameworks using two randomized controlled trials with endpoints derived from spinal X-ray images. Our findings indicate that using AI as a supporting reader (AI-SR) is the most suitable approach for clinical trials, as it meets all criteria across various model types, even with bad models. This method consistently provides reliable disease estimation, preserves clinical trial treatment effect estimates and conclusions, and retains these advantages when applied to different populations.
☆ Adapting Quantum Machine Learning for Energy Dissociation of Bonds
Accurate prediction of bond dissociation energies (BDEs) underpins mechanistic insight and the rational design of molecules and materials. We present a systematic, reproducible benchmark comparing quantum and classical machine learning models for BDE prediction using a chemically curated feature set encompassing atomic properties (atomic numbers, hybridization), bond characteristics (bond order, type), and local environmental descriptors. Our quantum framework, implemented in Qiskit Aer on six qubits, employs ZZFeatureMap encodings with variational ansatz (RealAmplitudes) across multiple architectures Variational Quantum Regressors (VQR), Quantum Support Vector Regressors (QSVR), Quantum Neural Networks (QNN), Quantum Convolutional Neural Networks (QCNN), and Quantum Random Forests (QRF). These are rigorously benchmarked against strong classical baselines, including Support Vector Regression (SVR), Random Forests (RF), and Multi-Layer Perceptrons (MLP). Comprehensive evaluation spanning absolute and relative error metrics, threshold accuracies, and error distributions shows that top-performing quantum models (QCNN, QRF) match the predictive accuracy and robustness of classical ensembles and deep networks, particularly within the chemically prevalent mid-range BDE regime. These findings establish a transparent baseline for quantum-enhanced molecular property prediction and outline a practical foundation for advancing quantum computational chemistry toward near chemical accuracy.
☆ The Markovian Thinker
Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
☆ From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining NeurIPS 2025
Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.
comment: 22 pages, 11 figures, an abridged version to appear in NeurIPS 2025 LLM Evaluation Workshop
☆ Incoherence in goal-conditioned autoregressive models
We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.
☆ Cluster Paths: Navigating Interpretability in Neural Networks
While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.
☆ Scalable Policy-Based RL Algorithms for POMDPs NeurIPS 2025
The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
comment: 36 pages, 3 Figures, Accepted at NeurIPS 2025
☆ Auto-Prompt Ensemble for LLM Judge
We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.
☆ Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them
Agentic search leverages large language models (LLMs) to interpret complex user information needs and execute a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs' reasoning and agentic capabilities when interacting with retrieval systems and the broader web. In this paper, we propose a reasoning-driven LLM-based pipeline to study effective reasoning behavior patterns in agentic search. Using this pipeline, we analyze successful agentic search trajectories and identify four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Based on these findings, we propose a technique called Behavior Priming to train more effective agentic search models. It synthesizes agentic search trajectories that exhibit these four behaviors and integrates them into the agentic search model through supervised fine-tuning (SFT), followed by standard reinforcement learning (RL). Experiments on three benchmarks (GAIA, WebWalker, and HLE) demonstrate that behavior priming yields over 35% gains in Llama3.2-3B and Qwen3-1.7B compared to directly training agentic search models with RL. Crucially, we demonstrate that the desired reasoning behaviors in the SFT data, rather than the correctness of the final answer, is the critical factor for achieving strong final performance after RL: fine-tuning on trajectories with desirable reasoning behaviors but incorrect answers leads to better performance than fine-tuning on trajectories with correct answers. Our analysis further reveals the underlying mechanism: the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL. Our code will be released as open source.
☆ From Description to Detection: LLM based Extendable O-RAN Compliant Blind DoS Detection in 5G and Beyond
The quality and experience of mobile communication have significantly improved with the introduction of 5G, and these improvements are expected to continue beyond the 5G era. However, vulnerabilities in control-plane protocols, such as Radio Resource Control (RRC) and Non-Access Stratum (NAS), pose significant security threats, such as Blind Denial of Service (DoS) attacks. Despite the availability of existing anomaly detection methods that leverage rule-based systems or traditional machine learning methods, these methods have several limitations, including the need for extensive training data, predefined rules, and limited explainability. Addressing these challenges, we propose a novel anomaly detection framework that leverages the capabilities of Large Language Models (LLMs) in zero-shot mode with unordered data and short natural language attack descriptions within the Open Radio Access Network (O-RAN) architecture. We analyse robustness to prompt variation, demonstrate the practicality of automating the attack descriptions and show that detection quality relies on the semantic completeness of the description rather than its phrasing or length. We utilise an RRC/NAS dataset to evaluate the solution and provide an extensive comparison of open-source and proprietary LLM implementations to demonstrate superior performance in attack detection. We further validate the practicality of our framework within O-RAN's real-time constraints, illustrating its potential for detecting other Layer-3 attacks.
☆ BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music
Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) We propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.
comment: Under review
☆ Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture
We establish that randomly initialized neural networks, with large width and a natural choice of hyperparameters, have nearly independent outputs exactly when their activation function is nonlinear with zero mean under the Gaussian measure: $\mathbb{E}_{z \sim \mathcal{N}(0,1)}[\sigma(z)]=0$. For example, this includes ReLU and GeLU with an additive shift, as well as tanh, but not ReLU or GeLU by themselves. Because of their nearly independent outputs, we propose neural networks with zero-mean activation functions as a promising candidate for the Alignment Research Center's computational no-coincidence conjecture -- a conjecture that aims to measure the limits of AI interpretability.
☆ Pseudo-MDPs: A Novel Framework for Efficiently Optimizing Last Revealer Seed Manipulations in Blockchains
This study tackles the computational challenges of solving Markov Decision Processes (MDPs) for a restricted class of problems. It is motivated by the Last Revealer Attack (LRA), which undermines fairness in some Proof-of-Stake (PoS) blockchains such as Ethereum (\$400B market capitalization). We introduce pseudo-MDPs (pMDPs) a framework that naturally models such problems and propose two distinct problem reductions to standard MDPs. One problem reduction provides a novel, counter-intuitive perspective, and combining the two problem reductions enables significant improvements in dynamic programming algorithms such as value iteration. In the case of the LRA which size is parameterized by $\kappa$ (in Ethereum's case $\kappa$= 325), we reduce the computational complexity from $O(2^\kappa \kappa^{2^{\kappa+2}})$ to $O(\kappa^4)$ (per iteration). This solution also provide the usual benefits from Dynamic Programming solutions: exponentially fast convergence toward the optimal solution is guaranteed. The dual perspective also simplifies policy extraction, making the approach well-suited for resource-constrained agents who can operate with very limited memory and computation once the problem has been solved. Furthermore, we generalize those results to a broader class of MDPs, enhancing their applicability. The framework is validated through two case studies: a fictional card game and the LRA on the Ethereum random seed consensus protocol. These applications demonstrate the framework's ability to solve large-scale problems effectively while offering actionable insights into optimal strategies. This work advances the study of MDPs and contributes to understanding security vulnerabilities in blockchain systems.
☆ LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics
Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.
☆ From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .
☆ Retentive Relevance: Capturing Long-Term User Value in Recommendation Systems
Recommendation systems have traditionally relied on short-term engagement signals, such as clicks and likes, to personalize content. However, these signals are often noisy, sparse, and insufficient for capturing long-term user satisfaction and retention. We introduce Retentive Relevance, a novel content-level survey-based feedback measure that directly assesses users' intent to return to the platform for similar content. Unlike other survey measures that focus on immediate satisfaction, Retentive Relevance targets forward-looking behavioral intentions, capturing longer term user intentions and providing a stronger predictor of retention. We validate Retentive Relevance using psychometric methods, establishing its convergent, discriminant, and behavioral validity. Through large-scale offline modeling, we show that Retentive Relevance significantly outperforms both engagement signals and other survey measures in predicting next-day retention, especially for users with limited historical engagement. We develop a production-ready proxy model that integrates Retentive Relevance into the final stage of a multi-stage ranking system on a social media platform. Calibrated score adjustments based on this model yield substantial improvements in engagement, and retention, while reducing exposure to low-quality content, as demonstrated by large-scale A/B experiments. This work provides the first empirically validated framework linking content-level user perceptions to retention outcomes in production systems. We offer a scalable, user-centered solution that advances both platform growth and user experience. Our work has broad implications for responsible AI development.
☆ DGTEN: A Robust Deep Gaussian based Graph Neural Network for Dynamic Trust Evaluation with Uncertainty-Quantification Support
Dynamic trust evaluation in large, rapidly evolving graphs requires models that can capture changing relationships, express calibrated confidence, and resist adversarial manipulation. DGTEN (Deep Gaussian-based Trust Evaluation Network) introduces a unified graph framework that achieves all three by combining uncertainty-aware message passing, expressive temporal modeling, and built-in defenses against trust-targeted attacks. It represents nodes and edges as Gaussian distributions so that both semantic signals and epistemic uncertainty propagate through the graph neural network, enabling risk-aware trust decisions rather than overconfident guesses. To model how trust evolves, it employs hybrid Absolute-Gaussian-Hourglass (HAGH) positional encoding with Kolmogorov-Arnold network-based unbiased multi-head attention, followed by an ordinary differential equation (ODE)-based residual learning module to jointly capture abrupt shifts and smooth trends. Robust adaptive ensemble coefficient analysis prunes or down-weights suspicious interactions using complementary cosine and Jaccard similarity measures, mitigating reputation laundering, sabotage, and on/off attacks. On two signed Bitcoin trust networks, DGTEN delivers significant improvements: in single-timeslot prediction on Bitcoin-Alpha, it improves MCC by 10.77% over the best dynamic baseline; in the cold-start scenario, it achieves a 16.41% MCC gain - the largest across all tasks and datasets. Under adversarial on/off attacks, it surpasses the baseline by up to 11.63% MCC. These results validate the effectiveness of the unified DGTEN framework.
comment: 18 pages, 9 figures, 5 tables
☆ Transformer-Based Indirect Structural Health Monitoring of Rail Infrastructure with Attention-Driven Detection and Localization of Transient Defects
Indirect structural health monitoring (iSHM) for broken rail detection using onboard sensors presents a cost-effective paradigm for railway track assessment, yet reliably detecting small, transient anomalies (2-10 cm) remains a significant challenge due to complex vehicle dynamics, signal noise, and the scarcity of labeled data limiting supervised approaches. This study addresses these issues through unsupervised deep learning. We introduce an incremental synthetic data benchmark designed to systematically evaluate model robustness against progressively complex challenges like speed variations, multi-channel inputs, and realistic noise patterns encountered in iSHM. Using this benchmark, we evaluate several established unsupervised models alongside our proposed Attention-Focused Transformer. Our model employs a self-attention mechanism, trained via reconstruction but innovatively deriving anomaly scores primarily from deviations in learned attention weights, aiming for both effectiveness and computational efficiency. Benchmarking results reveal that while transformer-based models generally outperform others, all tested models exhibit significant vulnerability to high-frequency localized noise, identifying this as a critical bottleneck for practical deployment. Notably, our proposed model achieves accuracy comparable to the state-of-the-art solution while demonstrating better inference speed. This highlights the crucial need for enhanced noise robustness in future iSHM models and positions our more efficient attention-based approach as a promising foundation for developing practical onboard anomaly detection systems.
comment: Preprint presented at the 15th International Workshop on Structural Health Monitoring (IWSHM)
Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Reconstruction NeurIPS 2025
Charged particle track reconstruction is a foundational task in collider experiments and the main computational bottleneck in particle reconstruction. Graph neural networks (GNNs) have shown strong performance for this problem, but costly graph construction, irregular computations, and random memory access patterns substantially limit their throughput. The recently proposed Hashing-based Efficient Point Transformer (HEPT) offers a theoretically guaranteed near-linear complexity for large point cloud processing via locality-sensitive hashing (LSH) in attention computations; however, its evaluations have largely focused on embedding quality, and the object condensation pipeline on which HEPT relies requires a post-hoc clustering step (e.g., DBScan) that can dominate runtime. In this work, we make two contributions. First, we present a unified, fair evaluation of physics tracking performance for HEPT and a representative GNN-based pipeline under the same dataset and metrics. Second, we introduce HEPTv2 by extending HEPT with a lightweight decoder that eliminates the clustering stage and directly predicts track assignments. This modification preserves HEPT's regular, hardware-friendly computations while enabling ultra-fast end-to-end inference. On the TrackML dataset, optimized HEPTv2 achieves approximately 28 ms per event on an A100 while maintaining competitive tracking efficiency. These results position HEPTv2 as a practical, scalable alternative to GNN-based pipelines for fast tracking.
comment: Accepted to NeurIPS 2025 Machine Learning and the Physical Sciences Workshop
☆ TGM: a Modular and Efficient Library for Machine Learning on Temporal Graphs
Well-designed open-source software drives progress in Machine Learning (ML) research. While static graph ML enjoys mature frameworks like PyTorch Geometric and DGL, ML for temporal graphs (TG), networks that evolve over time, lacks comparable infrastructure. Existing TG libraries are often tailored to specific architectures, hindering support for diverse models in this rapidly evolving field. Additionally, the divide between continuous- and discrete-time dynamic graph methods (CTDG and DTDG) limits direct comparisons and idea transfer. To address these gaps, we introduce Temporal Graph Modelling (TGM), a research-oriented library for ML on temporal graphs, the first to unify CTDG and DTDG approaches. TGM offers first-class support for dynamic node features, time-granularity conversions, and native handling of link-, node-, and graph-level tasks. Empirically, TGM achieves an average 7.8x speedup across multiple models, datasets, and tasks compared to the widely used DyGLib, and an average 175x speedup on graph discretization relative to available implementations. Beyond efficiency, we show in our experiments how TGM unlocks entirely new research possibilities by enabling dynamic graph property prediction and time-driven training paradigms, opening the door to questions previously impractical to study. TGM is available at https://github.com/tgm-team/tgm
comment: 21 pages, 5 figures, 14 tables
☆ Expanding the Action Space of LLMs to Reason Beyond Language
Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.
☆ Linguistic Patterns in Pandemic-Related Content: A Comparative Analysis of COVID-19, Constraint, and Monkeypox Datasets
This study conducts a computational linguistic analysis of pandemic-related online discourse to examine how language distinguishes health misinformation from factual communication. Drawing on three corpora: COVID-19 false narratives (n = 7588), general COVID-19 content (n = 10700), and Monkeypox-related posts (n = 5787), we identify significant differences in readability, rhetorical markers, and persuasive language use. COVID-19 misinformation exhibited markedly lower readability scores and contained over twice the frequency of fear-related or persuasive terms compared to the other datasets. It also showed minimal use of exclamation marks, contrasting with the more emotive style of Monkeypox content. These patterns suggest that misinformation employs a deliberately complex rhetorical style embedded with emotional cues, a combination that may enhance its perceived credibility. Our findings contribute to the growing body of work on digital health misinformation by highlighting linguistic indicators that may aid detection efforts. They also inform public health messaging strategies and theoretical models of crisis communication in networked media environments. At the same time, the study acknowledges limitations, including reliance on traditional readability indices, use of a deliberately narrow persuasive lexicon, and reliance on static aggregate analysis. Future research should therefore incorporate longitudinal designs, broader emotion lexicons, and platform-sensitive approaches to strengthen robustness.
comment: 16 pages
☆ Accuracy, Memory Efficiency and Generalization: A Comparative Study on Liquid Neural Networks and Recurrent Neural Networks
This review aims to conduct a comparative analysis of liquid neural networks (LNNs) and traditional recurrent neural networks (RNNs) and their variants, such as long short-term memory networks (LSTMs) and gated recurrent units (GRUs). The core dimensions of the analysis include model accuracy, memory efficiency, and generalization ability. By systematically reviewing existing research, this paper explores the basic principles, mathematical models, key characteristics, and inherent challenges of these neural network architectures in processing sequential data. Research findings reveal that LNN, as an emerging, biologically inspired, continuous-time dynamic neural network, demonstrates significant potential in handling noisy, non-stationary data, and achieving out-of-distribution (OOD) generalization. Additionally, some LNN variants outperform traditional RNN in terms of parameter efficiency and computational speed. However, RNN remains a cornerstone in sequence modeling due to its mature ecosystem and successful applications across various tasks. This review identifies the commonalities and differences between LNNs and RNNs, summarizes their respective shortcomings and challenges, and points out valuable directions for future research, particularly emphasizing the importance of improving the scalability of LNNs to promote their application in broader and more complex scenarios.
comment: 13 pages, 12 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
♻ ☆ The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.
comment: Preprint
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for table 3, random effect is the model version; v3 changelog: Fixed minor formatting issues in tables 2 and 3;
♻ ☆ Edit-Based Flow Matching for Temporal Point Processes
Temporal point processes (TPPs) are a fundamental tool for modeling event sequences in continuous time, but most existing approaches rely on autoregressive parameterizations that are limited by their sequential sampling. Recent non-autoregressive, diffusion-style models mitigate these issues by jointly interpolating between noise and data through event insertions and deletions in a discrete Markov chain. In this work, we generalize this perspective and introduce an Edit Flow process for TPPs that transports noise to data via insert, delete, and substitute edit operations. By learning the instantaneous edit rates within a continuous-time Markov chain framework, we attain a flexible and efficient model that effectively reduces the total number of necessary edit operations during generation. Empirical results demonstrate the generative flexibility of our unconditionally trained model in a wide range of unconditional and conditional generation tasks on benchmark TPPs.
♻ ☆ BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining
Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.
♻ ☆ CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.
comment: Work in progress
♻ ☆ DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization
Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.
comment: Accepted to the NewSumm Workshop at EMNLP 2025
♻ ☆ Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs
We generalize the concept of M\"obius inversion and Shapley values to directed acyclic multigraphs and weighted versions thereof. We further allow value functions (games) and thus their M\"obius transforms (synergy function) and Shapley values to have values in any abelian group that is a module over a ring that contains the graph weights, e.g. vector-valued functions. To achieve this and overcome the obstruction that the classical axioms (linearity, efficiency, null player, symmetry) are not strong enough to uniquely determine Shapley values in this more general setting, we analyze Shapley values from two novel points of view: 1) We introduce projection operators that allow us to interpret Shapley values as the recursive projection and re-attribution of higher-order synergies to lower-order ones; 2) we propose a strengthening of the null player axiom and a localized symmetry axiom, namely the weak elements and flat hierarchy axioms. The former allows us to remove coalitions with vanishing synergy while preserving the rest of the hierarchical structure. The latter treats player-coalition bonds uniformly in the corner case of hierarchically flat graphs. Together with linearity these axioms already imply a unique explicit formula for the Shapley values, as well as classical properties like efficiency, null player, symmetry, and novel ones like the projection property. This whole framework then specializes to finite inclusion algebras, lattices, partial orders and mereologies, and also recovers certain previously known cases as corner cases, and presents others from a new perspective. The admission of general weighted directed acyclic multigraph structured hierarchies and vector-valued functions and Shapley values opens up the possibility for new analytic tools and application areas, like machine learning, language processing, explainable artificial intelligence, and many more.
comment: 43 pages, 2 figures
♻ ☆ Interpretable Clustering: A Survey
In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent. For convenient access and reference, an open repository organizes representative and emerging interpretable clustering methods under the taxonomy proposed in this survey, available at https://github.com/hulianyu/Awesome-Interpretable-Clustering
comment: 14 pages, 2 figures, 3 tables
♻ ☆ Empirical Comparison of Membership Inference Attacks in Deep Transfer Learning
With the emergence of powerful large-scale foundation models, the training paradigm is increasingly shifting from from-scratch training to transfer learning. This enables high utility training with small, domain-specific datasets typical in sensitive applications. Membership inference attacks (MIAs) provide an empirical estimate of the privacy leakage by machine learning models. Yet, prior assessments of MIAs against models fine-tuned with transfer learning rely on a small subset of possible attacks. We address this by comparing performance of diverse MIAs in transfer learning settings to help practitioners identify the most efficient attacks for privacy risk evaluation. We find that attack efficacy decreases with the increase in training data for score-based MIAs. We find that there is no one MIA which captures all privacy risks in models trained with transfer learning. While the Likelihood Ratio Attack (LiRA) demonstrates superior performance across most experimental scenarios, the Inverse Hessian Attack (IHA) proves to be more effective against models fine-tuned on PatchCamelyon dataset in high data regime.
comment: 30 pages, 13 figures, published in TMLR https://openreview.net/forum?id=UligTUCgdt
♻ ☆ Conditional Local Independence Testing for Itô processes with Applications to Dynamic Causal Discovery
Inferring causal relationships from dynamical systems is the central interest of many scientific inquiries. Conditional local independence, which describes whether the evolution of one process is influenced by another process given additional processes, is important for causal learning in such systems. In this paper, we propose a hypothesis test for conditional local independence in It\^o processes. Our test is grounded in the semimartingale decomposition of the It\^o process, with which we introduce a stochastic integral process that is a martingale under the null hypothesis. We then apply a test for the martingale property, quantifying potential deviation from local independence. The test statistics is estimated using the optimal filtering equation. We show the consistency of the estimation, thereby establishing the level and power of our test. Numerical verification and a real-world application to causal discovery in brain resting-state fMRIs are conducted.
comment: Preprint
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Scalable In-context Ranking with Generative Models
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.
♻ ☆ ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning
Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks. While recent advances in general motion tracking (GMT) have enabled humanoids to reproduce diverse human motions, these policies lack the precision and object awareness required for loco-manipulation. To this end, we introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data. First, a GMT policy, trained on large-scale human-only motion, serves as a task-agnostic base for generating human-like whole-body movements. An efficient but precise residual policy is then learned to refine the GMT outputs to improve locomotion and incorporate object interaction. To further facilitate efficient training, we design (i) a point-cloud-based object tracking reward for smoother optimization, (ii) a contact reward that encourages accurate humanoid body-object interactions, and (iii) a curriculum-based virtual object controller to stabilize early training. We evaluate ResMimic in both simulation and on a real Unitree G1 humanoid. Results show substantial gains in task success, training efficiency, and robustness over strong baselines. Videos are available at https://resmimic.github.io/ .
comment: 9 pages, 8 figures
♻ ☆ NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks
In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce $\mathit{NdLinear}$, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our $\mathit{NdLinear-LoRA}$ matches or exceeds standard LoRA on language reasoning tasks using up to $9\times$ fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.
comment: Code is available at https://github.com/ensemble-core/NdLinear
♻ ☆ Valid Inference with Imperfect Synthetic Data NeurIPS 2025
Predictions and generations from large language models are increasingly being explored as an aid in limited data regimes, such as in computational social science and human subjects research. While prior technical work has mainly explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (e.g., synthetic simulations), such as in responses to surveys. However, it remains unclear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this paper, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address this challenge. Intriguingly, we find that interactions between the moment residuals of synthetic data and those of real data (i.e., when they are predictive of each other) can greatly improve estimates of the target parameter. We validate the finite-sample performance of our estimator across different tasks in computational social science applications, demonstrating large empirical gains.
comment: NeurIPS 2025
♻ ☆ Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents
Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna-Think, DDT uses a two-stage training process to first improve the agent's world modeling ability via objectives such as state prediction or critique generation, and then improve the agent's action via policy training. We evaluate our methods on OSWorld and WindowsAgentArena, and demonstrate that Dyna-Think improves the agent's in-domain and out-of-domain performance, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.
♻ ☆ SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models
Proteins play crucial roles in almost all biological processes. The advancement of deep learning has greatly accelerated the development of protein foundation models, leading to significant successes in protein understanding and design. However, the lack of systematic red-teaming for these models has raised serious concerns about their potential misuse, such as generating proteins with biological safety risks. This paper introduces SafeProtein, the first red-teaming framework designed for protein foundation models to the best of our knowledge. SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red-teaming methods and conduct tests on protein foundation models. We also curated SafeProtein-Bench, which includes a manually constructed red-teaming benchmark dataset and a comprehensive evaluation protocol. SafeProtein achieved continuous jailbreaks on state-of-the-art protein foundation models (up to 70% attack success rate for ESM3), revealing potential biological safety risks in current protein foundation models and providing insights for the development of robust security protection technologies for frontier models. The codes will be made publicly available at https://github.com/jigang-fan/SafeProtein.
♻ ☆ Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks
Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss--Newton update must be solved in the parameter space, incurring a prohibitive $O(n^3)$ time complexity, where $n$ is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size $m = \sum_{\gamma} N_{\gamma} d_{\gamma}$, where each residual class $\gamma$ (e.g. PDE interior, boundary, initial data) contributes $N_{\gamma}$ collocation points of output dimension $d_{\gamma}$. Building on this insight, we introduce \textit{Dual Natural Gradient Descent} (D-NGD). D-NGD computes the Gauss--Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest $m$ and a Nystrom-preconditioned conjugate-gradient solver for larger $m$. Experimentally, D-NGD scales second-order PINN optimization to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final error $L^2$ than first-order methods (Adam, SGD) and quasi-Newton methods, and -- crucially -- enables natural-gradient training of PINNs at this scale on a single GPU.
♻ ☆ Lossy Neural Compression for Geospatial Analytics: A Review
Over the past decades, there has been an explosion in the amount of available Earth Observation (EO) data. The unprecedented coverage of the Earth's surface and atmosphere by satellite imagery has resulted in large volumes of data that must be transmitted to ground stations, stored in data centers, and distributed to end users. Modern Earth System Models (ESMs) face similar challenges, operating at high spatial and temporal resolutions, producing petabytes of data per simulated day. Data compression has gained relevance over the past decade, with neural compression (NC) emerging from deep learning and information theory, making EO data and ESM outputs ideal candidates due to their abundance of unlabeled data. In this review, we outline recent developments in NC applied to geospatial data. We introduce the fundamental concepts of NC including seminal works in its traditional applications to image and video compression domains with focus on lossy compression. We discuss the unique characteristics of EO and ESM data, contrasting them with "natural images", and explain the additional challenges and opportunities they present. Moreover, we review current applications of NC across various EO modalities and explore the limited efforts in ESM compression to date. The advent of self-supervised learning (SSL) and foundation models (FM) has advanced methods to efficiently distill representations from vast unlabeled data. We connect these developments to NC for EO, highlighting the similarities between the two fields and elaborate on the potential of transferring compressed feature representations for machine--to--machine communication. Based on insights drawn from this review, we devise future directions relevant to applications in EO and ESM.
comment: self-consistent review paper
♻ ☆ AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science. Code is at https://github.com/innovatingAI/AutoMind.
comment: Ongoing work
♻ ☆ Prefilled responses enhance zero-shot detection of AI-generated images
As AI models generate increasingly realistic images, growing concerns over potential misuse underscore the need for reliable detection. Traditional supervised detection methods depend on large, curated datasets for training and often fail to generalize to novel, out-of-domain image generators. As an alternative, we explore pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. We evaluate VLM performance on three diverse benchmarks encompassing synthetic images of human faces, objects, and animals produced by 16 different state-of-the-art image generators. While off-the-shelf VLMs perform poorly on these datasets, we find that their reasoning can be guided effectively through simple response prefilling -- a method we call Prefill-Guided Thinking (PGT). In particular, prefilling a VLM response with the task-aligned phrase "Let's examine the style and the synthesis artifacts" improves the Macro F1 scores of three widely used open-source VLMs by up to 24%.
♻ ☆ KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
comment: Work in progress
♻ ☆ Bit-Level Discrete Diffusion with Markov Probabilistic Models: An Improved Framework with Sharp Convergence Bounds under Minimal Assumptions
This paper introduces Discrete Markov Probabilistic Models (DMPMs), a novel discrete diffusion algorithm for discrete data generation. The algorithm operates in discrete bit space, where the noising process is a continuous-time Markov chain that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation of a function of the forward process, underlining theoretical alignment with score-based generative models. We establish convergence bounds for the algorithm under minimal assumptions, ensuring robustness and efficiency, which we demonstrate through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight competitive performance in generating discrete structures compared to the state-of-the-art. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling.
♻ ☆ Differential Privacy for Adaptive Weight Aggregation in Federated Tumor Segmentation
Federated Learning (FL) is a distributed machine learning approach that safeguards privacy by creating an impartial global model while respecting the privacy of individual client data. However, the conventional FL method can introduce security risks when dealing with diverse client data, potentially compromising privacy and data integrity. To address these challenges, we present a differential privacy (DP) federated deep learning framework in medical image segmentation. In this paper, we extend our similarity weight aggregation (SimAgg) method to DP-SimAgg algorithm, a differentially private similarity-weighted aggregation algorithm for brain tumor segmentation in multi-modal magnetic resonance imaging (MRI). Our DP-SimAgg method not only enhances model segmentation capabilities but also provides an additional layer of privacy preservation. Extensive benchmarking and evaluation of our framework, with computational performance as a key consideration, demonstrate that DP-SimAgg enables accurate and robust brain tumor segmentation while minimizing communication costs during model training. This advancement is crucial for preserving the privacy of medical image data and safeguarding sensitive information. In conclusion, adding a differential privacy layer in the global weight aggregation phase of the federated brain tumor segmentation provides a promising solution to privacy concerns without compromising segmentation model efficacy. By leveraging DP, we ensure the protection of client data against adversarial attacks and malicious participants.
comment: I have changed the methodology because of some technical errors in this version
♻ ☆ AMBER: Adaptive Mesh Generation by Iterative Mesh Resolution Prediction NeurIPS2025
The cost and accuracy of simulating complex physical systems using the Finite Element Method (FEM) scales with the resolution of the underlying mesh. Adaptive meshes improve computational efficiency by refining resolution in critical regions, but typically require task-specific heuristics or cumbersome manual design by a human expert. We propose Adaptive Meshing By Expert Reconstruction (AMBER), a supervised learning approach to mesh adaptation. Starting from a coarse mesh, AMBER iteratively predicts the sizing field, i.e., a function mapping from the geometry to the local element size of the target mesh, and uses this prediction to produce a new intermediate mesh using an out-of-the-box mesh generator. This process is enabled through a hierarchical graph neural network, and relies on data augmentation by automatically projecting expert labels onto AMBER-generated data during training. We evaluate AMBER on 2D and 3D datasets, including classical physics problems, mechanical components, and real-world industrial designs with human expert meshes. AMBER generalizes to unseen geometries and consistently outperforms multiple recent baselines, including ones using Graph and Convolutional Neural Networks, and Reinforcement Learning-based approaches.
comment: Accepted to NeurIPS2025
♻ ☆ Last-iterate Convergence for Symmetric, General-sum, $2 \times 2$ Games Under The Exponential Weights Dynamic
We conduct a comprehensive analysis of the discrete-time exponential-weights dynamic with a constant step size on all \emph{general-sum and symmetric} $2 \times 2$ normal-form games, i.e. games with $2$ pure strategies per player, and where the ensuing payoff tuple is of the form $(A,A^\top)$ (where $A$ is the $2 \times 2$ payoff matrix corresponding to the first player). Such symmetric games commonly arise in real-world interactions between "symmetric" agents who have identically defined utility functions -- such as Bertrand competition, multi-agent performative prediction, and certain congestion games -- and display a rich multiplicity of equilibria despite the seemingly simple setting. Somewhat surprisingly, we show through a first-principles analysis that the exponential weights dynamic, which is popular in online learning, converges in the last iterate for such games regardless of initialization with an appropriately chosen step size. For certain games and/or initializations, we further show that the convergence rate is in fact exponential and holds for any step size. We illustrate our theory with extensive simulations and applications to the aforementioned game-theoretic interactions. In the case of multi-agent performative prediction, we formulate a new "mortgage competition" game between lenders (i.e. banks) who interact with a population of customers, and show that it fits into our framework.
♻ ☆ From Injection to Defense: Constructing Edit-Based Fingerprints for Large Language Models
Fingerprinting is critical for maintaining traceability and protecting the intellectual property (IP) of developers, as LLMs deployed in web applications are susceptible to unauthorized redistribution and misuse via fine-tuning or black-box deployment. However, current backdoor-based fingerprinting methods face a fundamental trade-off: fingerprints embedded as garbled text are easily detected and filtered, whereas those crafted as coherent natural language are prone to being triggered unintentionally. To overcome these limitations, we propose RFEdit, a knowledge-editing framework that embeds a rule-based multilingual natural language fingerprint (MNLF) by modifying a sparse subset of model weights. This approach enables efficient and robust fingerprint injection with minimal impact on unrelated knowledge in LLMs. Our RFEdit framework is further safeguarded by Fingerprint Subspace-aware Fine-Tuning (FSFT), which mitigates fingerprint degradation during legitimate fine-tuning by restricting parameter updates to the fingerprint subspace. This approach preserves fingerprint integrity while enhancing downstream task performance of LLMs. These advances establish a comprehensive pipeline from fingerprint injection to defense, achieving high detection effectiveness, robustness against adversarial manipulations, harmlessness to model utility, and persistence under fine-tuning. Extensive experiments demonstrate that RFEdit maintains robustness under quantization and pruning. Additionally, fingerprint effectiveness is generally improved by more than 10\% when combined with FSFT for math and alpaca downstream tasks.
comment: preprint
♻ ☆ Autonomy-Aware Clustering: When Local Decisions Supersede Global Prescriptions
Clustering arises in a wide range of problem formulations, yet most existing approaches assume that the entities under clustering are passive and strictly conform to their assigned groups. In reality, entities often exhibit local autonomy, overriding prescribed associations in ways not fully captured by feature representations. Such autonomy can substantially reshape clustering outcomes -- altering cluster compositions, geometry, and cardinality -- with significant downstream effects on inference and decision-making. We introduce autonomy-aware clustering, a reinforcement learning (RL) framework that learns and accounts for the influence of local autonomy without requiring prior knowledge of its form. Our approach integrates RL with a Deterministic Annealing (DA) procedure, where, to determine underlying clusters, DA naturally promotes exploration in early stages of annealing and transitions to exploitation later. We also show that the annealing procedure exhibits phase transitions that enable design of efficient annealing schedules. To further enhance adaptability, we propose the Adaptive Distance Estimation Network (ADEN), a transformer-based attention model that learns dependencies between entities and cluster representatives within the RL loop, accommodates variable-sized inputs and outputs, and enables knowledge transfer across diverse problem instances. Empirical results show that our framework closely aligns with underlying data dynamics: even without explicit autonomy models, it achieves solutions close to the ground truth (gap ~3-4%), whereas ignoring autonomy leads to substantially larger gaps (~35-40%). The code and data are publicly available at https://github.com/salar96/AutonomyAwareClustering.
comment: Preprint. Under review at a peer-reviewed venue. Minor formatting correction: the earlier version included an incorrect conference header, which has been removed. Content unchanged
♻ ☆ On Task Vectors and Gradients
Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
comment: 10 pages of main paper, 5 figures
♻ ☆ AbsoluteNet: A Deep Learning Neural Network to Classify Cerebral Hemodynamic Responses of Auditory Processing
In recent years, deep learning (DL) approaches have demonstrated promising results in decoding hemodynamic responses captured by functional near-infrared spectroscopy (fNIRS), particularly in the context of brain-computer interface (BCI) applications. This work introduces AbsoluteNet, a novel deep learning architecture designed to classify auditory event-related responses recorded using fNIRS. The proposed network is built upon principles of spatio-temporal convolution and customized activation functions. Our model was compared against several models, namely fNIRSNET, MDNN, DeepConvNet, and ShallowConvNet. The results showed that AbsoluteNet outperforms existing models, reaching 87.0% accuracy, 84.8% sensitivity, and 89.2% specificity in binary classification, surpassing fNIRSNET, the second-best model, by 3.8% in accuracy. These findings underscore the effectiveness of our proposed deep learning model in decoding hemodynamic responses related to auditory processing and highlight the importance of spatio-temporal feature aggregation and customized activation functions to better fit fNIRS dynamics.
♻ ☆ FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models
Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}{\texttt{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}}.
♻ ☆ Enjoying Non-linearity in Multinomial Logistic Bandits
We consider the multinomial logistic bandit problem, a variant of where a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant $\kappa_* \geq 1$, that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over $T$ rounds from $\smash{O(d\sqrt{T})}$ to $\smash{O(d\sqrt{T/\kappa_*})}$, where $d$ is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To achieve this, we extend the definition of \( \kappa_* \) to the multinomial setting and propose an efficient algorithm that leverages the problem's non-linearity. Our method yields a problem-dependent regret bound of order $ \smash{\widetilde{\mathcal{O}}( R d \sqrt{{KT}/{\kappa_*}})} $, where $R$ is the norm of the vector of rewards and $K$ is the number of outcomes. This improves upon the best existing guarantees of order $ \smash{\widetilde{\mathcal{O}}( RdK \sqrt{T} )} $. Moreover, we provide a $\smash{ \Omega(Rd\sqrt{KT/\kappa_*})}$ lower-bound, showing that our algorithm is minimax-optimal and that our definition of $\kappa_*$ is optimal.
♻ ☆ MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning
Learning object-level, structured representations is widely regarded as a key to better generalization in vision and underpins the design of next-generation Pre-trained Vision Models (PVMs). Mainstream Object-Centric Learning (OCL) methods adopt Slot Attention or its variants to iteratively aggregate objects' super-pixels into a fixed set of query feature vectors, termed slots. However, their reliance on a static slot count leads to an object being represented as multiple parts when the number of objects varies. We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts. MetaSlot (i) maintains a codebook that holds prototypes of objects in a dataset by vector-quantizing the resulting slot representations; (ii) removes duplicate slots from the traditionally aggregated slots by quantizing them with the codebook; and (iii) injects progressively weaker noise into the Slot Attention iterations to accelerate and stabilize the aggregation. MetaSlot is a general Slot Attention variant that can be seamlessly integrated into existing OCL architectures. Across multiple public datasets and tasks--including object discovery and recognition--models equipped with MetaSlot achieve significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.
♻ ☆ TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
comment: 14 pages, 16 figures. For source code, see https://github.com/microsoft/tokenweave. In version 2, Figure 6 shows All Reduce bandwidth, not Reduce Scatter. The Multimem Reduce Scatter bandwidth formula differs slightly from the Ring-based version
♻ ☆ Want to train KANS at scale? Now UKAN!
Kolmogorov-Arnold Networks (KANs) have recently emerged as a powerful alternative to traditional multilayer perceptrons. However, their reliance on predefined, bounded grids restricts their ability to approximate functions on unbounded domains. To address this, we present Unbounded Kolmogorov-Arnold Networks (UKANs), a method that removes the need for bounded grids in traditional Kolmogorov-Arnold Networks (KANs). The key innovation of this method is a coefficient-generator (CG) model that produces, on the fly, only the B-spline coefficients required locally on an unbounded symmetric grid. UKANs couple multilayer perceptrons with KANs by feeding the positional encoding of grid groups into the CG model, enabling function approximation on unbounded domains without requiring data normalization. To reduce the computational cost of both UKANs and KANs, we introduce a GPU-accelerated library that lowers B-spline evaluation complexity by a factor proportional to the grid size, enabling large-scale learning by leveraging efficient memory management, in line with recent software advances such as FlashAttention and FlashFFTConv. Performance benchmarking confirms the superior memory and computational efficiency of our accelerated KAN (warpKAN), and UKANs, showing a 3-30x speed-up and up to 1000x memory reduction compared to vanilla KANs. Experiments on regression, classification, and generative tasks demonstrate the effectiveness of UKANs to match or surpass KAN accuracy. Finally, we use both accelerated KAN and UKAN in a molecular property prediction task, establishing the feasibility of large-scale end-to-end training with our optimized implementation.
comment: 16 pages, 5 figures, 8 tables
♻ ☆ 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
comment: v2: sp for vlms; code at https://github.com/Qihoo360/360-LLaMA-Factory
♻ ☆ Data-Driven Adaptive PID Control Based on Physics-Informed Neural Networks
This article proposes a data-driven PID controller design based on the principle of adaptive gain optimization, leveraging Physics-Informed Neural Networks (PINNs) generated for predictive modeling purposes. The proposed control design method utilizes gradients of the PID gain optimization, achieved through the automatic differentiation of PINNs, to apply model predictive control using a cost function based on tracking error and control inputs. By optimizing PINNs-based PID gains, the method achieves adaptive gain tuning that ensures stability while accounting for system nonlinearities. The proposed method features a systematic framework for integrating PINNs-based models of dynamical control systems into closed-loop control systems, enabling direct application to PID control design. A series of numerical experiments is conducted to demonstrate the effectiveness of the proposed method from the control perspectives based on both time and frequency domains.
comment: This work has been submitted to the IEEE Transactions on Control Systems Technology for possible publication
♻ ☆ AdaDim: Dimensionality Adaptation for SSL Representational Dynamics
A key factor in effective Self-Supervised learning (SSL) is preventing dimensional collapse, where higher-dimensional representation spaces ($R$) span a lower-dimensional subspace. Therefore, SSL optimization strategies involve guiding a model to produce $R$ with a higher dimensionality ($H(R)$) through objectives that encourage decorrelation of features or sample uniformity in $R$. A higher $H(R)$ indicates that $R$ has greater feature diversity which is useful for generalization to downstream tasks. Alongside dimensionality optimization, SSL algorithms also utilize a projection head that maps $R$ into an embedding space $Z$. Recent work has characterized the projection head as a filter of noisy or irrelevant features from the SSL objective by reducing the mutual information $I(R;Z)$. Therefore, the current literature's view is that a good SSL representation space should have a high $H(R)$ and a low $I(R;Z)$. However, this view of SSL is lacking in terms of an understanding of the underlying training dynamics that influences the relationship between both terms. Our analysis shows that the best performing SSL models do not have the highest $H(R)$ nor the lowest $I(R;Z)$, but effectively arrive at a balance between both. To take advantage of this analysis, we introduce AdaDim, a training strategy that leverages SSL training dynamics by adaptively balancing between increasing $H(R)$ through feature decorrelation and sample uniformity as well as gradual regularization of $I(R;Z)$ as training progresses. We show performance improvements of up to 3% over common SSL baselines despite our method not utilizing expensive techniques such as queues, clustering, predictor networks, or student-teacher architectures.
comment: Under Review
♻ ☆ Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose {AIGB-Pearl} (\emph{{P}lanning with {E}valu{A}tor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.
♻ ☆ Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in Multimodal LLMs
Hallucinations in LLMs--especially in multimodal settings--undermine reliability. We present a rigorous, information-geometric framework in diffusion dynamics that quantifies hallucination in MLLMs: model outputs are embedded spectrally on multimodal graph Laplacians, and gaps to a truth manifold define a semantic-distortion metric. We derive Courant--Fischer bounds on a temperature-dependent hallucination energy and use RKHS eigenmodes to obtain modality-aware, interpretable measures that track evolution over prompts and time. This reframes hallucination as measurable and bounded, providing a principled basis for evaluation and mitigation.
comment: 29 pages, 3 figures, 1 table
♻ ☆ RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation NeurIPS 2025
The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.
comment: Accepted at NeurIPS 2025
♻ ☆ Maximising the Utility of Validation Sets for Imbalanced Noisy-label Meta-learning
Meta-learning is an effective method to handle imbalanced and noisy-label learning, but it depends on a validation set containing randomly selected, manually labelled and balanced distributed samples. The random selection and manual labelling and balancing of this validation set is not only sub-optimal for meta-learning, but it also scales poorly with the number of classes. Hence, recent meta-learning papers have proposed ad-hoc heuristics to automatically build and label this validation set, but these heuristics are still sub-optimal for meta-learning. In this paper, we analyse the meta-learning algorithm and propose new criteria to characterise the utility of the validation set, based on: 1) the informativeness of the validation set; 2) the class distribution balance of the set; and 3) the correctness of the labels of the set. Furthermore, we propose a new imbalanced noisy-label meta-learning (INOLML) algorithm that automatically builds a validation set by maximising its utility using the criteria above. Our method shows significant improvements over previous meta-learning approaches and sets the new state-of-the-art on several benchmarks.
♻ ☆ Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models NeurIPS
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.
comment: The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/
♻ ☆ Error Bounds for Physics-Informed Neural Networks in Fokker-Planck PDEs UAI
Stochastic differential equations are commonly used to describe the evolution of stochastic processes. The state uncertainty of such processes is best represented by the probability density function (PDF), whose evolution is governed by the Fokker-Planck partial differential equation (FP-PDE). However, it is generally infeasible to solve the FP-PDE in closed form. In this work, we show that physics-informed neural networks (PINNs) can be trained to approximate the solution PDF. Our main contribution is the analysis of PINN approximation error: we develop a theoretical framework to construct tight error bounds using PINNs. In addition, we derive a practical error bound that can be efficiently constructed with standard training methods. We discuss that this error-bound framework generalizes to approximate solutions of other linear PDEs. Empirical results on nonlinear, high-dimensional, and chaotic systems validate the correctness of our error bounds while demonstrating the scalability of PINNs and their significant computational speedup in obtaining accurate PDF solutions compared to the Monte Carlo approach.
comment: Accepted at Uncertainty in Artificial Intelligence (UAI) 2025
♻ ☆ DiffMI: Breaking Face Recognition Privacy via Diffusion-Driven Training-Free Model Inversion
Face recognition poses serious privacy risks due to its reliance on sensitive and immutable biometric data. While modern systems mitigate privacy risks by mapping facial images to embeddings (commonly regarded as privacy-preserving), model inversion attacks reveal that identity information can still be recovered, exposing critical vulnerabilities. However, existing attacks are often computationally expensive and lack generalization, especially those requiring target-specific training. Even training-free approaches suffer from limited identity controllability, hindering faithful reconstruction of nuanced or unseen identities. In this work, we propose DiffMI, the first diffusion-driven, training-free model inversion attack. DiffMI introduces a novel pipeline combining robust latent code initialization, a ranked adversarial refinement strategy, and a statistically grounded, confidence-aware optimization objective. DiffMI applies directly to unseen target identities and face recognition models, offering greater adaptability than training-dependent approaches while significantly reducing computational overhead. Our method achieves 84.42%--92.87% attack success rates against inversion-resilient systems and outperforms the best prior training-free GAN-based approach by 4.01%--9.82%. The implementation is available at https://github.com/azrealwang/DiffMI.
♻ ☆ Learning to Recover: Dynamic Reward Shaping with Wheel-Leg Coordination for Fallen Robots
Adaptive recovery from fall incidents are essential skills for the practical deployment of wheeled-legged robots, which uniquely combine the agility of legs with the speed of wheels for rapid recovery. However, traditional methods relying on preplanned recovery motions, simplified dynamics or sparse rewards often fail to produce robust recovery policies. This paper presents a learning-based framework integrating Episode-based Dynamic Reward Shaping and curriculum learning, which dynamically balances exploration of diverse recovery maneuvers with precise posture refinement. An asymmetric actor-critic architecture accelerates training by leveraging privileged information in simulation, while noise-injected observations enhance robustness against uncertainties. We further demonstrate that synergistic wheel-leg coordination reduces joint torque consumption by 15.8% and 26.2% and improves stabilization through energy transfer mechanisms. Extensive evaluations on two distinct quadruped platforms achieve recovery success rates up to 99.1% and 97.8% without platform-specific tuning. The supplementary material is available at https://boyuandeng.github.io/L2R-WheelLegCoordination/
♻ ☆ Interpretable Robot Control via Structured Behavior Trees and Large Language Models
As intelligent robots become more integrated into human environments, there is a growing need for intuitive and reliable Human-Robot Interaction (HRI) interfaces that are adaptable and more natural to interact with. Traditional robot control methods often require users to adapt to interfaces or memorize predefined commands, limiting usability in dynamic, unstructured environments. This paper presents a novel framework that bridges natural language understanding and robotic execution by combining Large Language Models (LLMs) with Behavior Trees. This integration enables robots to interpret natural language instructions given by users and translate them into executable actions by activating domain-specific plugins. The system supports scalable and modular integration, with a primary focus on perception-based functionalities, such as person tracking and hand gesture recognition. To evaluate the system, a series of real-world experiments was conducted across diverse environments. Experimental results demonstrate that the proposed approach is practical in real-world scenarios, with an average cognition-to-execution accuracy of approximately 94%, making a significant contribution to HRI systems and robots. The complete source code of the framework is publicly available at https://github.com/snt-arg/robot_suite.
comment: 15 pages, 5 figures, 3 tables
♻ ☆ GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm
Deep neural networks are highly vulnerable to adversarial examples that inputs with small, carefully crafted perturbations that cause misclassification, making adversarial attacks an essential tool for robustness evaluation. Existing black-box attacks fall into three categories: query-only, transfer-only, and query-and-transfer, and vary in perturbation pattern and optimization strategy. However, no prior method jointly achieves query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization, leaving a gap between black-box flexibility and white-box precision. We present GreedyPixel, a new attack framework that fills this gap by combining a surrogate-derived pixel priority map with greedy, per-pixel optimization refined by query feedback. This design reduces the exponential brute-force search space to a tractable linear procedure, guarantees monotonic loss decrease and convergence to a coordinate-wise optimum, and concentrates perturbations on robust, semantically meaningful pixels to improve perceptual quality. Extensive experiments on CIFAR-10 and ImageNet under both white-box and black-box settings demonstrate that GreedyPixel achieves state-of-the-art attack success rates and produces visually imperceptible perturbations. Our results show that GreedyPixel bridges the precision gap between white-box and black-box attacks and provides a practical framework for fine-grained robustness evaluation. The implementation is available at https://github.com/azrealwang/greedypixel.
♻ ☆ A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport
Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.
♻ ☆ Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.
♻ ☆ Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning KDD 2025
With the increasing computation of training graph neural networks (GNNs) on large-scale graphs, graph condensation (GC) has emerged as a promising solution to synthesize a compact, substitute graph of the large-scale original graph for efficient GNN training. However, existing GC methods predominantly employ classification as the surrogate task for optimization, thus excessively relying on node labels and constraining their utility in label-sparsity scenarios. More critically, this surrogate task tends to overfit class-specific information within the condensed graph, consequently restricting the generalization capabilities of GC for other downstream tasks. To address these challenges, we introduce Contrastive Graph Condensation (CTGC), which adopts a self-supervised surrogate task to extract critical, causal information from the original graph and enhance the cross-task generalizability of the condensed graph. Specifically, CTGC employs a dual-branch framework to disentangle the generation of the node attributes and graph structures, where a dedicated structural branch is designed to explicitly encode geometric information through nodes' positional embeddings. By implementing an alternating optimization scheme with contrastive loss terms, CTGC promotes the mutual enhancement of both branches and facilitates high-quality graph generation through the model inversion technique. Extensive experiments demonstrate that CTGC excels in handling various downstream tasks with a limited number of labels, consistently outperforming state-of-the-art GC methods.
comment: KDD 2025
♻ ☆ Domain Generalization by Rejecting Extreme Augmentations WACV 2024
Data augmentation is one of the most effective techniques for regularizing deep learning models and improving their recognition performance in a variety of tasks and domains. However, this holds for standard in-domain settings, in which the training and test data follow the same distribution. For the out-of-domain case, where the test data follow a different and unknown distribution, the best recipe for data augmentation is unclear. In this paper, we show that for out-of-domain and domain generalization settings, data augmentation can provide a conspicuous and robust improvement in performance. To do that, we propose a simple training procedure: (i) use uniform sampling on standard data augmentation transformations; (ii) increase the strength transformations to account for the higher data variance expected when working out-of-domain, and (iii) devise a new reward function to reject extreme transformations that can harm the training. With this procedure, our data augmentation scheme achieves a level of accuracy that is comparable to or better than state-of-the-art methods on benchmark domain generalization datasets. Code: https://github.com/Masseeh/DCAug
comment: WACV 2024: Winter Conference on Applications of Computer Vision 2024
♻ ☆ Improving Neutral Point-of-View Generation with Data- and Parameter-Efficient RL
The paper shows that parameter-efficient reinforcement learning (PE-RL) is a highly effective training regime to improve large language models' (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e. to provide significantly more informative, diverse and impartial answers. This is shown by evaluating PE-RL and multiple strong baselines-including LoRA finetuning (strongest baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline ($97.06\%\rightarrow 99.08\%$), but also scores much higher on features linguists identify as key to separating sufficient answers from "great'' answers ($60.25\%\rightarrow 85.21\%$ for presence of supportive details, $68.74\%\rightarrow 91.43\%$ for absence of oversimplification). A qualitative analysis corroborates this. Moreover, our evaluation also finds a key property of PE-RL for this task: unlike methods that update all parameters, it generalises out of topic. Finally, to enable further studies we also release the dataset, SHQ-NPOV, and provide a methodology to create such datasets through iterative rounds of human peer-critique and annotator training.
♻ ☆ Real-Time Progress Prediction in Reasoning Language Models
Recent advances in reasoning language models -- particularly those that use long, latent chains of thought -- have demonstrated remarkable capabilities in complex, agentic tasks. However, as these models operate over increasingly extended time horizons, their internal progress becomes opaque to users, complicating expectation management and real-time oversight. In this work, we investigate whether real-time progress prediction is feasible. We discretize progress and train a linear probe to classify reasoning states. We then introduce a two-stage fine-tuning approach that enables reasoning models to generate progress estimates (0$\rightarrow$100\%) during inference. Our best fine-tuned model achieves an average error of 10\% for sequences less than 16,000 tokens, offering a practical mechanism for monitoring and interpreting model reasoning in real time.
♻ ☆ Train-Free Segmentation in MRI with Cubical Persistent Homology
We present a new general framework for segmentation of MRI scans based on Topological Data Analysis (TDA), offering several advantages over traditional machine learning approaches. The pipeline proceeds in three steps, first identifying the whole object to segment via automatic thresholding, then detecting a distinctive subset whose topology is known in advance, and finally deducing the various components of the segmentation. Unlike most prior TDA uses in medical image segmentation, which are typically embedded within deep networks, our approach is a standalone method tailored to MRI. A key ingredient is the localization of representative cycles from the persistence diagram, which enables interpretable mappings from topological features to anatomical components. In particular, the method offers the ability to perform segmentation without the need for large annotated datasets. Its modular design makes it adaptable to a wide range of data segmentation challenges. We validate the framework on three applications: glioblastoma segmentation in brain MRI, where a sphere is to be detected; myocardium in cardiac MRI, forming a cylinder; and cortical plate detection in fetal brain MRI, whose 2D slices are circles. We compare our method with established supervised and unsupervised baselines.
comment: Preprint, 36 pages, 18 figures, 4 tables. For associated code, see https://github.com/antonfrancois/gliomaSegmentation_TDA
♻ ☆ Automating RT Planning at Scale: High Quality Data For AI Training
Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances with artificial intelligence (AI) promise to improve its precision and efficiency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Varian Eclipse. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations is proposed. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. To our best knowledge, this dataset features more than 10 times number of plans compared to the largest existing well-curated public dataset. Repo: https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge.
comment: radiotherapy planning, data for AI training
♻ ☆ Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs
Deep learning (DL) algorithms are often defined in terms of temporal relationships: a tensor at one timestep may depend on tensors from earlier or later timesteps. Such dynamic dependencies (and corresponding dynamic tensor shapes) are difficult to express and optimize: while eager DL systems support such dynamism, they cannot apply compiler-based optimizations; graph-based systems require static tensor shapes, which forces users to pad tensors or break-up programs into multiple static graphs. We describe Tempo, a new DL system that combines the dynamism of eager execution with the whole-program optimizations of graph-based compilation. Tempo achieves this through a declarative programming model with recurrent tensors, which include explicit temporal dimensions. Temporal dimensions can be indexed using symbolic expressions to express dynamic dependencies on past and future tensors. Based on this, Tempo constructs a symbolic dependence graph, which concisely encodes dynamic dependencies between operators, and applies whole-program optimizations, such as algebraic simplifications, vectorization, tiling, and fusion. By tiling dynamic dependencies into static-size blocks, Tempo can also reuse existing static code-generators. It then uses a polyhedral model to find a feasible execution schedule, which includes memory management operations. We show that Tempo achieves a 7$\times$ speedup over JAX for Llama-3.2-3B decoding; for reinforcement learning algorithms, Tempo achieves a 54$\times$ speedup, with 16$\times$ lower peak memory usage.
comment: 17 pages, 24 figures, 3 bibliography pages
♻ ☆ Who Pays for Fairness? Rethinking Recourse under Social Burden
Machine learning based predictions are increasingly used in sensitive decision-making applications that directly affect our lives. This has led to extensive research into ensuring the fairness of classifiers. Beyond just fair classification, emerging legislation now mandates that when a classifier delivers a negative decision, it must also offer actionable steps an individual can take to reverse that outcome. This concept is known as algorithmic recourse. Nevertheless, many researchers have expressed concerns about the fairness guarantees within the recourse process itself. In this work, we provide a holistic theoretical characterization of unfairness in algorithmic recourse, formally linking fairness guarantees in recourse and classification, and highlighting limitations of the standard equal cost paradigm. We then introduce a novel fairness framework based on social burden, along with a practical algorithm (MISOB), broadly applicable under real-world conditions. Empirical results on real-world datasets show that MISOB reduces the social burden across all groups without compromising overall classifier accuracy.
♻ ☆ GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series
Counterfactual explanations aim to enhance model transparency by illustrating how input modifications can change model predictions. In the multivariate time series domain, existing approaches often produce counterfactuals that lack validity, plausibility, or intuitive interpretability. We present \textbf{GenFacts}, a novel generative framework for producing plausible and actionable counterfactual explanations for time series classifiers. GenFacts introduces a structured approach to latent space modeling and targeted counterfactual synthesis. We evaluate GenFacts on radar gesture recognition as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts consistently outperforms baseline methods in plausibility metrics (+18.7\%) and achieves the highest interpretability scores in user studies. These results underscore that realism and user-centered interpretability, rather than sparsity alone, are vital for actionable counterfactuals in time series applications.
comment: 4 pages
♻ ☆ NAR-*ICP: Neural Execution of Classical ICP-based Pointcloud Registration Algorithms
This study explores the intersection of neural networks and classical robotics algorithms through the Neural Algorithmic Reasoning (NAR) blueprint, enabling the training of neural networks to reason like classical robotics algorithms by learning to execute them. Algorithms are integral to robotics and safety-critical applications due to their predictable and consistent performance through logical and mathematical principles. In contrast, while neural networks are highly adaptable, handling complex, high-dimensional data and generalising across tasks, they often lack interpretability and transparency in their internal computations. To bridge the two, we propose a novel Graph Neural Network (GNN)-based framework, NAR-*ICP, that learns the intermediate computations of classical ICP-based registration algorithms, extending the CLRS Benchmark. We evaluate our approach across real-world and synthetic datasets, demonstrating its flexibility in handling complex inputs, and its potential to be used within larger learning pipelines. Our method achieves superior performance compared to the baselines, even surpassing the algorithms it was trained on, further demonstrating its ability to generalise beyond the capabilities of traditional algorithms.
comment: 19 pages, 16 tables, 7 figures
♻ ☆ Inference-Time Scaling of Discrete Diffusion Models via Importance Weighting and Optimal Proposal Design
Discrete diffusion models have become highly effective across various domains. However, real-world applications often require the generative process to adhere to certain constraints. To this end, we propose a Sequential Monte Carlo (SMC) framework that enables scalable inference-time control of discrete diffusion models through principled importance weighting and optimal proposal construction. Specifically, our approach derives tractable importance weights for a range of intermediate targets and characterises the optimal proposal, for which we develop two practical approximations: a first-order gradient-based approximation and an amortised proposal trained to minimise the log-variance of the importance weights. Empirical results across synthetic tasks, language modelling, biology design, and text-to-image generation demonstrate that our framework enhances controllability and sample quality, highlighting the effectiveness of SMC as a versatile recipe for scaling discrete diffusion models at inference time.
♻ ☆ Flow Matching for Robust Simulation-Based Inference under Model Misspecification
Simulation-based inference (SBI) is transforming experimental sciences by enabling parameter estimation in complex non-linear models from simulated data. A persistent challenge, however, is model misspecification: simulators are only approximations of reality, and mismatches between simulated and real data can yield biased or overconfident posteriors. We address this issue by introducing Flow Matching Corrected Posterior Estimation (FMCPE), a framework that leverages the flow matching paradigm to refine simulation-trained posterior estimators using a small set of real calibration samples. Our approach proceeds in two stages: first, a posterior approximator is trained on abundant simulated data; second, flow matching transports its predictions toward the true posterior supported by real observations, without requiring explicit knowledge of the misspecification. This design enables FMCPE to combine the scalability of SBI with robustness to distributional shift. Across synthetic benchmarks and real-world datasets, we show that our proposal consistently mitigates the effects of misspecification, delivering improved inference accuracy and uncertainty calibration compared to standard SBI baselines, while remaining computationally efficient.
♻ ☆ Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.
♻ ☆ P3D: Scalable Neural Surrogates for High-Resolution 3D Physics Simulations with Global Context
We present a scalable framework for learning deterministic and probabilistic neural surrogates for high-resolution 3D physics simulations. We introduce a hybrid CNN-Transformer backbone architecture targeted for 3D physics simulations, which significantly outperforms existing architectures in terms of speed and accuracy. Our proposed network can be pretrained on small patches of the simulation domain, which can be fused to obtain a global solution, optionally guided via a fast and scalable sequence-to-sequence model to include long-range dependencies. This setup allows for training large-scale models with reduced memory and compute requirements for high-resolution datasets. We evaluate our backbone architecture against a large set of baseline methods with the objective to simultaneously learn the dynamics of 14 different types of PDEs in 3D. We demonstrate how to scale our model to high-resolution isotropic turbulence with spatial resolutions of up to $512^3$. Finally, we demonstrate the versatility of our network by training it as a diffusion model to produce probabilistic samples of highly turbulent 3D channel flows across varying Reynolds numbers, accurately capturing the underlying flow statistics.
♻ ☆ GRPO is Secretly a Process Reward Model ICLR 2026
We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs trained with $\lambda$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.
comment: 14 pages, 6 figures; under review at ICLR 2026
♻ ☆ When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity
LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at https://github.com/penfever/judgment-to-noise
♻ ☆ Quantum Rationale-Aware Graph Contrastive Learning for Jet Discrimination
In high-energy physics, particle jet tagging plays a pivotal role in distinguishing quark from gluon jets using data from collider experiments. While graph-based deep learning methods have advanced this task beyond traditional feature-engineered approaches, the complex data structure and limited labeled samples present ongoing challenges. However, existing contrastive learning (CL) frameworks struggle to leverage rationale-aware augmentations effectively, often lacking supervision signals that guide the extraction of salient features and facing computational efficiency issues such as high parameter counts. In this study, we demonstrate that integrating a quantum rationale generator (QRG) within our proposed Quantum Rationale-aware Graph Contrastive Learning (QRGCL) framework significantly enhances jet discrimination performance, reducing reliance on labeled data and capturing discriminative features. Evaluated on the quark-gluon jet dataset, QRGCL achieves an AUC score of $77.53\%$ while maintaining a compact architecture of only 45 QRG parameters, outperforming classical, quantum, and hybrid GCL and GNN benchmarks. These results highlight QRGCL's potential to advance jet tagging and other complex classification tasks in high-energy physics, where computational efficiency and feature extraction limitations persist.
♻ ☆ FedAGHN: Personalized Federated Learning with Attentive Graph HyperNetworks
Personalized Federated Learning (PFL) aims to address the statistical heterogeneity of data across clients by learning the personalized model for each client. Among various PFL approaches, the personalized aggregation-based approach conducts parameter aggregation in the server-side aggregation phase to generate personalized models, and focuses on learning appropriate collaborative relationships among clients for aggregation. However, the collaborative relationships vary in different scenarios and even at different stages of the FL process. To this end, we propose Personalized Federated Learning with Attentive Graph HyperNetworks (FedAGHN), which employs Attentive Graph HyperNetworks (AGHNs) to dynamically capture fine-grained collaborative relationships and generate client-specific personalized initial models. Specifically, AGHNs empower graphs to explicitly model the client-specific collaborative relationships, construct collaboration graphs, and introduce tunable attentive mechanism to derive the collaboration weights, so that the personalized initial models can be obtained by aggregating parameters over the collaboration graphs. Extensive experiments can demonstrate the superiority of FedAGHN. Moreover, a series of visualizations are presented to explore the effectiveness of collaboration graphs learned by FedAGHN.
comment: Final accepted manuscript of the article published in Knowledge-Based Systems, 2025
♻ ☆ Unlocking Dataset Distillation with Diffusion Models
Dataset distillation seeks to condense datasets into smaller but highly representative synthetic samples. While diffusion models now lead all generative benchmarks, current distillation methods avoid them and rely instead on GANs or autoencoders, or, at best, sampling from a fixed diffusion prior. This trend arises because naive backpropagation through the long denoising chain leads to vanishing gradients, which prevents effective synthetic sample optimization. To address this limitation, we introduce Latent Dataset Distillation with Diffusion Models (LD3M), the first method to learn gradient-based distilled latents and class embeddings end-to-end through a pre-trained latent diffusion model. A linearly decaying skip connection, injected from the initial noisy state into every reverse step, preserves the gradient signal across dozens of timesteps without requiring diffusion weight fine-tuning. Across multiple ImageNet subsets at 128x128 and 256x256, LD3M improves downstream accuracy by up to 4.8 percentage points (1 IPC) and 4.2 points (10 IPC) over the prior state-of-the-art. The code for LD3M is provided at https://github.com/Brian-Moser/prune_and_distill.
♻ ☆ ExLLM: Experience-Enhanced LLM Optimization for Molecular Design and Beyond
Molecular design involves an enormous and irregular search space, where traditional optimizers such as Bayesian optimization, genetic algorithms, and generative models struggle to leverage expert knowledge or handle complex feedback. Recently, LLMs have been used as optimizers, achieving promising results on benchmarks such as PMO. However, existing approaches rely only on prompting or extra training, without mechanisms to handle complex feedback or maintain scalable memory. In particular, the common practice of appending or summarizing experiences at every query leads to redundancy, degraded exploration, and ultimately poor final outcomes under large-scale iterative search. We introduce ExLLM (Experience-Enhanced LLM optimization), an LLM-as-optimizer framework with three components: (1) a compact, evolving experience snippet tailored to large discrete spaces that distills non-redundant cues and improves convergence at low cost; (2) a simple yet effective k-offspring scheme that widens exploration per call and reduces orchestration cost; and (3) a lightweight feedback adapter that normalizes objectives for selection while formatting constraints and expert hints for iteration. ExLLM sets new state-of-the-art results on PMO and generalizes strongly in our setup, it sets records on circle packing and stellarator design, and yields consistent gains across additional domains requiring only a task-description template and evaluation functions to transfer.
comment: 10 pages, under review
♻ ☆ Estimating the Joint Probability of Scenario Parameters with Gaussian Mixture Copula Models
This paper presents the first application of Gaussian Mixture Copula Models to the statistical modeling of driving scenarios for the safety validation of automated driving systems. Knowledge of the joint probability distribution of scenario parameters is essential for scenario-based safety assessment, where risk quantification depends on the likelihood of concrete parameter combinations. Gaussian Mixture Copula Models bring together the multimodal expressivity of Gaussian Mixture Models and the flexibility of copulas, enabling separate modeling of marginal distributions and dependencies. We benchmark Gaussian Mixture Copula Models against previously proposed approaches - Gaussian Mixture Models and Gaussian Copula Models - using real-world driving data drawn from scenarios defined in United Nations Regulation No. 157. Our evaluation across approximately 18 million scenario instances demonstrates that Gaussian Mixture Copula Models consistently surpass Gaussian Copula Models and perform better than, or at least comparably to, Gaussian Mixture Models, as measured by both log-likelihood and Sinkhorn distance. These results are promising for the adoption of Gaussian Mixture Copula Models as a statistical foundation for future scenario-based validation frameworks.
comment: 8 pages, 4 figures; This work has been submitted to the IEEE for possible publication
♻ ☆ VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction NIPS
In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON's adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9% compared to DPOT and 44.7% compared to MPP, while requiring only 72.5% and 34.8% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in imperfect measurement systems where sampling frequencies may differ or frames might be dropped - common challenges in real-world settings - without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41% relative performance degradation compared to 71.37%-74.49% degradation in baseline methods, demonstrating its versatility for deploying in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON.
comment: update after NIPS suggestions
♻ ☆ Evil twins are not that evil: Qualitative insights into machine-generated prompts
It has been widely observed that language models (LMs) respond in predictable ways to algorithmically generated prompts that are seemingly unintelligible. This is both a sign that we lack a full understanding of how LMs work, and a practical challenge, because opaqueness can be exploited for harmful uses of LMs, such as jailbreaking. We present the first thorough analysis of opaque machine-generated prompts, or autoprompts, pertaining to 6 LMs of different sizes and families. We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation. A small but consistent proportion of the previous tokens are prunable, probably appearing in the prompt as a by-product of the fact that the optimization process fixes the number of tokens. The remaining tokens fall into two categories: filler tokens, which can be replaced with semantically unrelated substitutes, and keywords, that tend to have at least a loose semantic relation with the generation, although they do not engage in well-formed syntactic relations with it. Additionally, human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque. Finally, some of the ablations we applied to autoprompts yield similar effects in natural language inputs, suggesting that autoprompts emerge naturally from the way LMs process linguistic inputs in general.
comment: Published as workshop paper at BlackBox NLP 2025
♻ ☆ Token-based Audio Inpainting via Discrete Diffusion
Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Audio examples of our proposed method can be found at https://iftach21.github.io/.
♻ ☆ Is Supervised Learning Really That Different from Unsupervised?
We demonstrate how supervised learning can be decomposed into a two-stage procedure, where (1) all model parameters are selected in an unsupervised manner, and (2) the outputs y are added to the model, without changing the parameter values. This is achieved by a new model selection criterion that, in contrast to cross-validation, can be used also without access to y. For linear ridge regression, we bound the asymptotic out-of-sample risk of our method in terms of the optimal asymptotic risk. We also demonstrate on real and synthetic data that versions of linear and kernel ridge regression, smoothing splines, and neural networks, which are trained without access to y, perform similarly to their standard y-based counterparts. Hence, our results suggest that the difference between supervised and unsupervised learning is less fundamental than it may appear.
♻ ☆ Achieving Hyperbolic-Like Expressiveness with Arbitrary Euclidean Regions: A New Approach to Hierarchical Embeddings
Hierarchical data is common in many domains like life sciences and e-commerce, and its embeddings often play a critical role. While hyperbolic embeddings offer a theoretically grounded approach to representing hierarchies in low-dimensional spaces, current methods often rely on specific geometric constructs as embedding candidates. This reliance limits their generalizability and makes it difficult to integrate with techniques that model semantic relationships beyond pure hierarchies, such as ontology embeddings. In this paper, we present RegD, a flexible Euclidean framework that supports the use of arbitrary geometric regions -- such as boxes and balls -- as embedding representations. Although RegD operates entirely in Euclidean space, we formally prove that it achieves hyperbolic-like expressiveness by incorporating a depth-based dissimilarity between regions, enabling it to emulate key properties of hyperbolic geometry, including exponential growth. Our empirical evaluation on diverse real-world datasets shows consistent performance gains over state-of-the-art methods and demonstrates RegD's potential for broader applications such as the ontology embedding task that goes beyond hierarchy.
♻ ☆ GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning
Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.
comment: 4 pages, 2 figures
♻ ☆ 2 OLMo 2 Furious
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T\"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini.
comment: Shorter version accepted to COLM 2025. Updated to include 32B results. Model demo available at playground.allenai.org
♻ ☆ Distributional Machine Unlearning via Selective Data Removal
Machine learning systems increasingly face requirements to remove entire domains of information -- such as toxic language or biases -- rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain's statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for exponential families and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15-82% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.
♻ ☆ An Empirical Analysis of the Laplace and Neural Tangent Kernels
The neural tangent kernel is a kernel function defined over the parameter distribution of an infinite width neural network. Despite the impracticality of this limit, the neural tangent kernel has allowed for a more direct study of neural networks and a gaze through the veil of their black box. More recently, it has been shown theoretically that the Laplace kernel and neural tangent kernel share the same reproducing kernel Hilbert space in the space of $\mathbb{S}^{d-1}$ alluding to their equivalence. In this work, we analyze the practical equivalence of the two kernels. We first do so by matching the kernels exactly and then by matching posteriors of a Gaussian process. Moreover, we analyze the kernels in $\mathbb{R}^d$ and experiment with them in the task of regression.
comment: 105 pages, 23 figures, published in California State University's ScholarWorks repository as part of master's degree completion
♻ ☆ Jailbreak Attack Initializations as Extractors of Compliance Directions
Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.
comment: Accepted to Findings of the Association for Computational Linguistics 2025 (EMNLP 2025)
♻ ☆ GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification
Integrating Pre-trained Language Models (PLMs) with Graph Neural Networks (GNNs) remains a central challenge in text-rich heterophilic graph learning. We propose a novel integration framework that enables effective fusion between powerful pre-trained text encoders and Relational Graph Convolutional Networks (R-GCNs). Our method enhances the alignment of textual and structural representations through a bidirectional fusion mechanism and contrastive node-level optimization. To evaluate the approach, we train two variants using different PLMs: Snowflake-Embed (state-of-the-art) and GTE-base, each paired with an R-GCN backbone. Experiments on five heterophilic benchmarks demonstrate that our integration method achieves state-of-the-art results on four datasets, surpassing existing GNN and large language model-based approaches. Notably, Snowflake-Embed + R-GCN improves accuracy on the Texas dataset by over 8\% and on Wisconsin by nearly 5\%. These results highlight the effectiveness of our fusion strategy for advancing text-rich graph representation learning.
♻ ☆ A Dual-Agent Adversarial Framework for Robust Generalization in Deep Reinforcement Learning
Recently, empowered with the powerful capabilities of neural networks, reinforcement learning (RL) has successfully tackled numerous challenging tasks. However, while these models demonstrate enhanced decision-making abilities, they are increasingly prone to overfitting. For instance, a trained RL model often fails to generalize to even minor variations of the same task, such as a change in background color or other minor semantic differences. To address this issue, we propose a dual-agent adversarial policy learning framework, which allows agents to spontaneously learn the underlying semantics without introducing any human prior knowledge. Specifically, our framework involves a game process between two agents: each agent seeks to maximize the impact of perturbing on the opponent's policy by producing representation differences for the same state, while maintaining its own stability against such perturbations. This interaction encourages agents to learn generalizable policies, capable of handling irrelevant features from the high-dimensional observations. Extensive experimental results on the Procgen benchmark demonstrate that the adversarial process significantly improves the generalization performance of both agents, while also being applied to various RL algorithms, e.g., Proximal Policy Optimization (PPO). With the adversarial framework, the RL agent outperforms the baseline methods by a significant margin, especially in hard-level tasks, marking a significant step forward in the generalization capabilities of deep reinforcement learning.
♻ ☆ IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs
Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\, -- \,a quality-constrained \textbf{I}ntelligent \textbf{P}rompt \textbf{R}outing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $\tau \in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency. The deployed system and additional product details are publicly available at https://aws.amazon.com/bedrock/intelligent-prompt-routing/
♻ ☆ Sustainable LSTM-Based Precoding for RIS-Aided mmWave MIMO Systems with Implicit CSI
In this paper, we propose a sustainable long short-term memory (LSTM)-based precoding framework for reconfigurable intelligent surface (RIS)-assisted millimeter-wave (mmWave) MIMO systems. Instead of explicit channel state information (CSI) estimation, the framework exploits uplink pilot sequences to implicitly learn channel characteristics, reducing both pilot overhead and inference complexity. Practical hardware constraints are addressed by incorporating the phase-dependent amplitude model of RIS elements, while a multi-label training strategy improves robustness when multiple near-optimal codewords yield comparable performance. Simulations show that the proposed design achieves over 90% of the spectral efficiency of exhaustive search (ES) with only 2.2% of its computation time, cutting energy consumption by nearly two orders of magnitude. The method also demonstrates resilience under distribution mismatch and scalability to larger RIS arrays, making it a practical and energy-efficient solution for sustainable 6G wireless networks.
comment: 6 pages, 5 figures, 2 tables, and accepted by 2025 IEEE Globecom Workshops
♻ ☆ VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning NeurIPS 2025
Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
comment: Accepted by NeurIPS 2025
♻ ☆ CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
♻ ☆ Enhancing GraphQL Security by Detecting Malicious Queries Using Large Language Models, Sentence Transformers, and Convolutional Neural Networks
GraphQL's flexibility, while beneficial for efficient data fetching, introduces unique security vulnerabilities that traditional API security mechanisms often fail to address. Malicious GraphQL queries can exploit the language's dynamic nature, leading to denial-of-service attacks, data exfiltration through injection, and other exploits. Existing solutions, such as static analysis, rate limiting, and general-purpose Web Application Firewalls, offer limited protection against sophisticated, context-aware attacks. This paper presents a novel, AI-driven approach for real-time detection of malicious GraphQL queries. Our method combines static analysis with machine learning techniques, including Large Language Models (LLMs) for dynamic schema-based configuration, Sentence Transformers (SBERT and Doc2Vec) for contextual embedding of query payloads, and Convolutional Neural Networks (CNNs), Random Forests, and Multilayer Perceptrons for classification. We detail the system architecture, implementation strategies optimized for production environments (including ONNX Runtime optimization and parallel processing), and evaluate the performance of our detection models and the overall system under load. Results demonstrate high accuracy in detecting various threats, including SQL injection, OS command injection, and XSS exploits, alongside effective mitigation of DoS and SSRF attempts. This research contributes a robust and adaptable solution for enhancing GraphQL API security.
♻ ☆ Robot Learning from Any Images
We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources, including camera captures, robotic datasets, and Internet images. At its core, our approach combines a novel method for single-view physical scene recovery with an efficient visual blending strategy for photorealistic data collection. We demonstrate RoLA's versatility across applications like scalable robotic data generation and augmentation, robot learning from Internet images, and single-image real-to-sim-to-real systems for manipulators and humanoids. Video results are available at https://sihengz02.github.io/RoLA .
comment: CoRL 2025 camera ready
♻ ☆ ProCut: LLM Prompt Compression via Attribution Estimation
In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.
♻ ☆ Membership Inference Attacks on LLM-based Recommender Systems
Large language models (LLMs) based Recommender Systems (RecSys) can flexibly adapt recommendation systems to different domains. It utilizes in-context learning (ICL), i.e., the prompts, to customize the recommendation functions, which include sensitive historical user-specific item interactions, e.g., implicit feedback like clicked items or explicit product reviews. Such private information may be exposed to novel privacy attack. However, no study has been done on this important issue. We design four membership inference attacks (MIAs), aiming to reveal whether victims' historical interactions have been used by system prompts. They are \emph{direct inquiry, hallucination, similarity, and poisoning attacks}, each of which utilizes the unique features of LLMs or RecSys. We have carefully evaluated them on three LLMs that have been used to develop ICL-LLM RecSys and two well-known RecSys benchmark datasets. The results confirm that the MIA threat on LLM RecSys is realistic: direct inquiry and poisoning attacks showing significantly high attack advantages. We have also analyzed the factors affecting these attacks, such as the number of shots in system prompts and the position of the victim in the shots.
comment: this paper is under review
♻ ☆ Unveiling the Basin-Like Loss Landscape in Large Language Models
We discover the emergence of \textit{basins} in the loss landscape of large language models. As model scale increases, LLMs become progressively more resilient to random perturbations in the parameter space, giving rise to expansive stability regions where models exhibit nearly identical performance, but outside of which their capabilities collapse. We observe that pre-training creates a \textit{basic capability} basin, and subsequent alignment fine-tuning forms \textit{specific capability} basins (e.g., safety, math, coding). Thus, we argue that benign fine-tuning confined to the basin should preserve prior capabilities. Besides, we also analyze the loss landscape for worst-case directions, which is consistently sharp and detrimental. We find that adversarial fine-tuning moves along the nearly worst-case directions, thus rapidly degrading model capabilities. Finally, we provide a theoretical analysis demonstrating that the basin size bounds the performance degradation of any fine-tuning, including the adversarial ones, while also guaranteeing the model robustness w.r.t. input perturbations, suggesting the benefit of enlarging basins.
♻ ☆ HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models NeurIPS 2025
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.
comment: NeurIPS 2025
♻ ☆ Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy.
♻ ☆ Guiding Mixture-of-Experts with Temporal Multimodal Interactions
Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.
comment: 21 pages, 8 figures, 10 tables
♻ ☆ Towards the Worst-case Robustness of Large Language Models
Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.
♻ ☆ Simulation-based inference via telescoping ratio estimation for trawl processes
The growing availability of large and complex datasets has increased interest in temporal stochastic processes that can capture stylized facts such as marginal skewness, non-Gaussian tails, long memory, and even non-Markovian dynamics. While such models are often easy to simulate from, parameter estimation remains challenging. Simulation-based inference (SBI) offers a promising way forward, but existing methods typically require large training datasets or complex architectures and frequently yield confidence (credible) regions that fail to attain their nominal values, raising doubts on the reliability of estimates for the very features that motivate the use of these models. To address these challenges, we propose a fast and accurate, sample-efficient SBI framework for amortized posterior inference applicable to intractable stochastic processes. The proposed approach relies on two main steps: first, we learn the posterior density by decomposing it sequentially across parameter dimensions. Then, we use Chebyshev polynomial approximations to efficiently generate independent posterior samples, enabling accurate inference even when Markov chain Monte Carlo methods mix poorly. We further develop novel diagnostic tools for SBI in this context, as well as post-hoc calibration techniques; the latter not only lead to performance improvements of the learned inferential tool, but also to the ability to reuse it directly with new time series of varying lengths, thus amortizing the training cost. We demonstrate the method's effectiveness on trawl processes, a class of flexible infinitely divisible models that generalize univariate Gaussian processes, applied to energy demand data.
♻ ☆ DPGIIL: Dirichlet Process-Deep Generative Model-Integrated Incremental Learning for Clustering in Transmissibility-based Online Structural Anomaly Detection
Clustering based on vibration responses, such as transmissibility functions (TFs), is promising in structural anomaly detection. However, most existing methods struggle to determine the optimal cluster number, handle high-dimensional streaming data, and rely heavily on manually engineered features due to their shallow structures. To address these issues, this work proposes a novel clustering framework, referred to as Dirichlet process-deep generative model-integrated incremental learning (DPGIIL), for online structural anomaly detection, which combines the advantages of deep generative models (DGMs) in representation learning and the Dirichlet process mixture model (DPMM) in identifying distinct patterns in observed data. Within the context of variational Bayesian inference, a lower bound on the log marginal likelihood of DPGIIL, tighter than the evidence lower bound, is derived analytically, which enables the joint optimization of DGM and DPMM parameters, thereby allowing the DPMM to regularize the DGM's feature extraction process. Additionally, a greedy split-merge scheme-based coordinate ascent variational inference method is devised to accelerate the optimization. The summary statistics of the DPMM, along with the network parameters, are used to retain information about previous data for incremental learning. For online structural anomaly detection, DPGIIL can not only detect anomalies by dynamically assigning incoming data to new clusters but also indicate different structural states using distinct clusters, thereby providing additional information about the operating conditions of the monitored structure compared to traditional anomaly detectors. Three case studies demonstrate the dynamic adaptability of the proposed method and show that it outperforms some state-of-the-art approaches in both structural anomaly detection and clustering.
comment: 53 pages,9 figures,7 tables, accepted by Advanced Engineering Informatics
♻ ☆ A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning
Effective feature selection is vital for robust and interpretable medical prediction, especially for identifying risk factors concentrated in extreme patient strata. Standard methods emphasize average associations and may miss predictors whose importance lies in the tails of the distribution. We propose a computationally efficient supervised filter that ranks features using the Gumbel copula upper tail dependence coefficient ($\lambda_U$), prioritizing variables that are simultaneously extreme with the positive class. We benchmarked against Mutual Information, mRMR, ReliefF, and $L_1$ Elastic Net across four classifiers on two diabetes datasets: a large public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Evaluation included paired statistical tests, permutation importance, and robustness checks with label flips, feature noise, and missingness. On CDC, our method was the fastest selector and reduced the feature space by about 52% while retaining strong discrimination. Although using all 21 features yielded the highest AUC, our filter significantly outperformed Mutual Information and mRMR and was statistically indistinguishable from ReliefF. On PIMA, with only eight predictors, our ranking produced the numerically highest ROC AUC, and no significant differences were found versus strong baselines. Across both datasets, the upper tail criterion consistently identified clinically coherent, impactful predictors. We conclude that copula based feature selection via upper tail dependence is a powerful, efficient, and interpretable approach for building risk models in public health and clinical medicine.
comment: Submitted
♻ ☆ Reinforcement Learning for Dynamic Memory Allocation
In recent years, reinforcement learning (RL) has gained popularity and has been applied to a wide range of tasks. One such popular domain where RL has been effective is resource management problems in systems. We look to extend work on RL for resource management problems by considering the novel domain of dynamic memory allocation management. We consider dynamic memory allocation to be a suitable domain for RL since current algorithms like first-fit, best-fit, and worst-fit can fail to adapt to changing conditions and can lead to fragmentation and suboptimal efficiency. In this paper, we present a framework in which an RL agent continuously learns from interactions with the system to improve memory management tactics. We evaluate our approach through various experiments using high-level and low-level action spaces and examine different memory allocation patterns. Our results show that RL can successfully train agents that can match and surpass traditional allocation strategies, particularly in environments characterized by adversarial request patterns. We also explore the potential of history-aware policies that leverage previous allocation requests to enhance the allocator's ability to handle complex request patterns. Overall, we find that RL offers a promising avenue for developing more adaptive and efficient memory allocation strategies, potentially overcoming limitations of hardcoded allocation algorithms.
♻ ☆ PolyKAN: A Polyhedral Analysis Framework for Provable and Approximately Optimal KAN Compression
Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to traditional Multi-Layer Perceptrons (MLPs), offering enhanced interpretability and a solid mathematical foundation. However, their parameter efficiency remains a significant challenge for practical deployment. This paper introduces PolyKAN, a novel theoretical framework for KAN compression that provides formal guarantees on both model size reduction and approximation error. By leveraging the inherent piecewise polynomial structure of KANs, we formulate the compression problem as a polyhedral region merging task. We establish a rigorous polyhedral characterization of KANs, develop a complete theory of $\epsilon$-equivalent compression, and design a dynamic programming algorithm that achieves approximately optimal compression under specified error bounds. Our theoretical analysis demonstrates that PolyKAN achieves provably near-optimal compression while maintaining strict error control, with guaranteed global optimality for univariate spline functions. This framework provides the first formal foundation for KAN compression with mathematical guarantees, opening new directions for the efficient deployment of interpretable neural architectures.
comment: The description of the paper's contributions has been tightened up, and statements that may cause misunderstandings have been removed
♻ ☆ Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.
♻ ☆ ACT-Tensor: Tensor Completion Framework for Financial Dataset Imputation
Missing data in financial panels presents a critical obstacle, undermining asset-pricing models and reducing the effectiveness of investment strategies. Such panels are often inherently multi-dimensional, spanning firms, time, and financial variables, which adds complexity to the imputation task. Conventional imputation methods often fail by flattening the data's multidimensional structure, struggling with heterogeneous missingness patterns, or overfitting in the face of extreme data sparsity. To address these limitations, we introduce an Adaptive, Cluster-based Temporal smoothing tensor completion framework (ACT-Tensor) tailored for severely and heterogeneously missing multi-dimensional financial data panels. ACT-Tensor incorporates two key innovations: a cluster-based completion module that captures cross-sectional heterogeneity by learning group-specific latent structures; and a temporal smoothing module that proactively removes short-lived noise while preserving slow-moving fundamental trends. Extensive experiments show that ACT-Tensor consistently outperforms state-of-the-art benchmarks in terms of imputation accuracy across a range of missing data regimes, including extreme sparsity scenarios. To assess its practical financial utility, we evaluate the imputed data with an asset-pricing pipeline tailored for tensor-structured financial data. Results show that ACT-Tensor not only reduces pricing errors but also significantly improves risk-adjusted returns of the constructed portfolio. These findings confirm that our method delivers highly accurate and informative imputations, offering substantial value for financial decision-making.
comment: The 6th ACM International Conference on AI in Finance (ICAIF 2025)
♻ ☆ Nonparametric Bellman Mappings for Value Iteration in Distributed Reinforcement Learning
This paper introduces novel Bellman mappings (B-Maps) for value iteration (VI) in distributed reinforcement learning (DRL), where agents are deployed over an undirected, connected graph/network with arbitrary topology -- but without a centralized node, that is, a node capable of aggregating all data and performing computations. Each agent constructs a nonparametric B-Map from its private data, operating on Q-functions represented in a reproducing kernel Hilbert space, with flexibility in choosing the basis for their representation. Agents exchange their Q-function estimates only with direct neighbors, and unlike existing DRL approaches that restrict communication to Q-functions, the proposed framework also enables the transmission of basis information in the form of covariance matrices, thereby conveying additional structural details. Linear convergence rates are established for both Q-function and covariance-matrix estimates toward their consensus values, regardless of the network topology, with optimal learning rates determined by the ratio of the smallest positive eigenvalue (the graph's Fiedler value) to the largest eigenvalue of the graph Laplacian matrix. A detailed performance analysis further shows that the proposed DRL framework effectively approximates the performance of a centralized node, had such a node existed. Numerical tests on two benchmark control problems confirm the effectiveness of the proposed nonparametric B-Maps relative to prior methods. Notably, the tests reveal a counter-intuitive outcome: although the framework involves richer information exchange -- specifically through transmitting covariance matrices as basis information -- it achieves the desired performance at a lower cumulative communication cost than existing DRL schemes, underscoring the critical role of sharing basis information in accelerating the learning process.
♻ ☆ Learning where to learn: Training data distribution optimization for scientific machine learning
In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.
comment: 48 pages, 18 figures, 3 tables, 3 algorithms
♻ ☆ MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
Multimodal large language models with Retrieval Augmented Generation (RAG) have significantly advanced tasks such as multimodal question answering by grounding responses in external text and images. This grounding improves factuality, reduces hallucination, and extends reasoning beyond parametric knowledge. However, this reliance on external knowledge poses a critical yet underexplored safety risk: knowledge poisoning attacks, where adversaries deliberately inject adversarial multimodal content into external knowledge bases to steer model toward generating incorrect or even harmful responses. To expose such vulnerabilities, we propose MM-PoisonRAG, the first framework to systematically design knowledge poisoning in multimodal RAG. We introduce two complementary attack strategies: Localized Poisoning Attack (LPA), which implants targeted multimodal misinformation to manipulate specific queries, and Globalized Poisoning Attack (GPA), which inserts a single adversarial knowledge to broadly disrupt reasoning and induce nonsensical responses across all queries. Comprehensive experiments across tasks, models, and access settings show that LPA achieves targeted manipulation with attack success rates of up to 56%, while GPA completely disrupts model generation to 0% accuracy with just a single adversarial knowledge injection. Our results reveal the fragility of multimodal RAG and highlight the urgent need for defenses against knowledge poisoning.
comment: Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG
♻ ☆ A Minimalist Bayesian Framework for Stochastic Optimization
The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the component of interest, such as the location of the optimum. Nuisance parameters are eliminated via profile likelihood, which naturally handles constraints. As a direct instantiation, we develop a MINimalist Thompson Sampling (MINTS) algorithm. Our framework accommodates structured problems, including continuum-armed Lipschitz bandits and dynamic pricing. It also provides a probabilistic lens on classical convex optimization algorithms such as the center of gravity and ellipsoid methods. We further analyze MINTS for multi-armed bandits and establish near-optimal regret guarantees.
comment: 27 pages, 2 figures
♻ ☆ An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
comment: 34 pages, 9 figures
♻ ☆ Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias
For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \\{ \lVert \widehat{w_p} \rVert_r \\}_{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal *spike* and a *bulk* of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n_\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of *all* $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $\alpha$ to an effective $p_{\mathrm{eff}}(\alpha)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.
♻ ☆ Platonic Transformers: A Solid Choice For Equivariance
While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.
♻ ☆ Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout
We propose Orthogonal Monte Carlo Dropout, a mechanism that enforces strict orthogonality when combining sparse semantic vectors without extra time complexity. Low-Rank Adaptation (LoRA), a popular fine-tuning method for large models, typically trains a module to represent a specific concept such as an object or a style. When multiple LoRA modules are merged, for example to generate an object in a particular style, their outputs (semantic vectors) may interfere with each other. Our method guarantees that merged LoRA modules remain orthogonal and thus free from direct interference. However, empirical analysis reveals that such orthogonality does not lead to the semantic disentanglement highlighted in prior work on compositional adaptation. This finding suggests that inter-LoRA orthogonality alone may be insufficient for achieving true semantic compositionality, prompting a re-examination of its role in adapter merging.
♻ ☆ Enjoying Non-linearity in Multinomial Logistic Bandits
We consider the multinomial logistic bandit problem, a variant of where a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant $\kappa_* \geq 1$, that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over $T$ rounds from $\smash{O(d\sqrt{T})}$ to $\smash{O(d\sqrt{T/\kappa_*})}$, where $d$ is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To achieve this, we extend the definition of $\kappa_*$ to the multinomial setting and propose an efficient algorithm that leverages the problem's non-linearity. Our method yields a problem-dependent regret bound of order $ \smash{\widetilde{\mathcal{O}}( R d \sqrt{{KT}/{\kappa_*}})} $, where $R$ is the norm of the vector of rewards and $K$ is the number of outcomes. This improves upon the best existing guarantees of order $ \smash{\widetilde{\mathcal{O}}( RdK \sqrt{T} )} $. Moreover, we provide a $\smash{ \Omega(Rd\sqrt{KT/\kappa_*})}$ lower-bound, showing that our algorithm is minimax-optimal and that our definition of $\kappa_*$ is optimal.
♻ ☆ Rex: Reversible Solvers for Diffusion Models
Diffusion models have quickly become the state-of-the-art for numerous generation tasks across many different applications. Encoding samples from the data distribution back into the models underlying prior distribution is an important task that arises in many downstream applications. This task is often called the inversion of diffusion models. Prior approaches for solving this task, however, are often simple heuristic solvers that come with several drawbacks in practice. In this work, we propose a new family of solvers for diffusion models by exploiting the connection between this task and the broader study of algebraically reversible solvers for differential equations. In particular, we construct a family of reversible solvers using an application of Lawson methods to construct exponential Runge-Kutta methods for the diffusion models. We call this family of reversible exponential solvers Rex. In addition to a rigorous theoretical analysis of the proposed solvers we also emonstrate the utility of the methods through a variety of empirical illustrations.
comment: Updated preprint. More results, experiments, and proofs
♻ ☆ Objective Features Extracted from Motor Activity Time Series for Food Addiction Analysis Using Machine Learning - A Pilot Study
Wearable sensors and IoT/IoMT platforms enable continuous, real-time monitoring, but objective digital markers for eating disorders are limited. In this study, we examined whether actimetry and machine learning (ML) could provide objective criteria for food addiction (FA) and symptom counts (SC). In 78 participants (mean age 22.1 +/- 9.5 y; 73.1% women), one week of non-dominant wrist actimetry and psychometric data (YFAS, DEBQ, ZSDS) were collected. The time series were segmented into daytime activity and nighttime rest, and statistical and entropy descriptors (FuzzyEn, DistEn, SVDEn, PermEn, PhaseEn; 256 features) were calculated. The mean Matthews correlation coefficient (MCC) was used as the primary metric in a K-nearest neighbors (KNN) pipeline with five-fold stratified cross-validation (one hundred repetitions; 500 evaluations); SHAP was used to assist in interpretation. For binary FA, activity-segment features performed best (MCC = 0.78 +/- 0.02; Accuracy ~ 95.3% +/- 0.5; Sensitivity ~ 0.77 +/- 0.03; Specificity ~ 0.98 +/- 0.004), exceeding OaS (Objective and Subjective Features) (MCC = 0.69 +/- 0.03) and rest-only (MCC = 0.50 +/- 0.03). For SC (four classes), OaS slightly surpassed actimetry (MCC = 0.40 +/- 0.01 vs 0.38 +/- 0.01; Accuracy ~ 58.1% vs 56.9%). Emotional and restrained eating were correlated with actimetric features. These findings support wrist-worn actimetry as a digital biomarker of FA that complements questionnaires and may facilitate privacy-preserving clinical translation.
comment: 25 pages, 4 figures, 14 tables
♻ ☆ OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.
comment: Project website: https://omniretarget.github.io
♻ ☆ Contrastive Difference Predictive Coding ICLR 2024
Predicting and reasoning about the future lie at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usually requires large amounts of data. In this paper, we introduce a temporal difference version of contrastive predictive coding that stitches together pieces of different time series data to decrease the amount of data required to learn predictions of future events. We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL. Experiments demonstrate that, compared with prior RL methods, ours achieves $2 \times$ median improvement in success rates and can better cope with stochastic environments. In tabular settings, we show that our method is about $20 \times$ more sample efficient than the successor representation and $1500 \times$ more sample efficient than the standard (Monte Carlo) version of contrastive predictive coding.
comment: ICLR 2024. Website (https://chongyi-zheng.github.io/td_infonce) and code (https://github.com/chongyi-zheng/td_infonce)
♻ ☆ Kernel-Free Universum Quadratic Surface Twin Support Vector Machines for Imbalanced Data
Binary classification tasks with imbalanced classes pose significant challenges in machine learning. Traditional classifiers often struggle to accurately capture the characteristics of the minority class, resulting in biased models with subpar predictive performance. In this paper, we introduce a novel approach to tackle this issue by leveraging Universum points to support the minority class within quadratic twin support vector machine models. Unlike traditional classifiers, our models utilize quadratic surfaces instead of hyperplanes for binary classification, providing greater flexibility in modeling complex decision boundaries. By incorporating Universum points, our approach enhances classification accuracy and generalization performance on imbalanced datasets. We generated four artificial datasets to demonstrate the flexibility of the proposed methods. Additionally, we validated the effectiveness of our approach through empirical evaluations on benchmark datasets, showing superior performance compared to conventional classifiers and existing methods for imbalanced classification.
♻ ☆ Can Large Reasoning Models Self-Train?
Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.
comment: Project website: https://self-rewarding-llm-training.github.io/
♻ ☆ Learning to Partially Defer for Sequences
In the Learning to Defer (L2D) framework, a prediction model can either make a prediction or defer it to an expert, as determined by a rejector. Current L2D methods train the rejector to decide whether to reject the {\em entire prediction}, which is not desirable when the model predicts long sequences. We present an L2D setting for sequence outputs where the system can defer \textit{specific outputs} of the whole model prediction to an expert in an effort to interleave the expert and machine throughout the prediction. We propose two types of model-based post-hoc rejectors for pre-trained predictors: a token-level rejector, which defers specific token predictions to experts with next token prediction capabilities, and a one-time rejector for experts without such abilities, which defers the remaining sequence from a specific point onward. In the experiments, we also empirically demonstrate that such granular deferrals achieve better cost-accuracy tradeoffs than whole deferrals on Traveling salesman solvers, News summarization, and Weather prediction.
♻ ☆ Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining
The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
♻ ☆ $\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a $\textit{permutation-invariant adversarial attack}$ that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of $\textit{maximum-flow minimum-cost}$, coupled with the novel $\textit{Permutation-Invariant Evasion Loss (PIEL)}$, we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including $\texttt{Llama}$, $\texttt{Mistral}$, $\texttt{Gemma}$, $\texttt{DeepSeek}$ and other variants on various datasets like $\texttt{JailBreakBench}$ and $\texttt{AdversarialBench}$, our method outperforms conventional attacks by up to $7\times$, exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of $\texttt{Llama-Guard}$ and $\texttt{PromptGuard}$, fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.
♻ ☆ MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification
We present a simple yet effective method to improve the robustness of both Convolutional and attention-based Neural Networks against adversarial examples by post-processing an adversarially trained model. Our technique, MeanSparse, cascades the activation functions of a trained model with novel operators that sparsify mean-centered feature vectors. This is equivalent to reducing feature variations around the mean, and we show that such reduced variations merely affect the model's utility, yet they strongly attenuate the adversarial perturbations and decrease the attacker's success rate. Our experiments show that, when applied to the top models in the RobustBench leaderboard, MeanSparse achieves a new robustness record of 75.28% (from 73.71%), 44.78% (from 42.67%) and 62.12% (from 59.56%) on CIFAR-10, CIFAR-100 and ImageNet, respectively, in terms of AutoAttack accuracy. Code is available at https://github.com/SPIN-UMass/MeanSparse
♻ ☆ Objective Features Extracted from Motor Activity Time Series for Food Addiction Analysis Using Machine Learning -- A Pilot Study
Wearable sensors and IoT/IoMT platforms enable continuous, real-time monitoring, but objective digital markers for eating disorders are limited. In this study, we examined whether actimetry and machine learning (ML) could provide objective criteria for food addiction (FA) and symptom counts (SC). In 78 participants (mean age 22.1 +/- 9.5 y; 73.1% women), one week of non-dominant wrist actimetry and psychometric data (YFAS, DEBQ, ZSDS) were collected. The time series were segmented into daytime activity and nighttime rest, and statistical and entropy descriptors (FuzzyEn, DistEn, SVDEn, PermEn, PhaseEn; 256 features) were calculated. The mean Matthews correlation coefficient (MCC) was used as the primary metric in a K-nearest neighbors (KNN) pipeline with five-fold stratified cross-validation (one hundred repetitions; 500 evaluations); SHAP was used to assist in interpretation. For binary FA, activity-segment features performed best (MCC = 0.78 +/- 0.02; Accuracy ~ 95.3% +/- 0.5; Sensitivity ~ 0.77 +/- 0.03; Specificity ~ 0.98 +/- 0.004), exceeding OaS (Objective and Subjective Features) (MCC = 0.69 +/- 0.03) and rest-only (MCC = 0.50 +/- 0.03). For SC (four classes), OaS slightly surpassed actimetry (MCC = 0.40 +/- 0.01 vs 0.38 +/- 0.01; Accuracy ~ 58.1% vs 56.9%). Emotional and restrained eating were correlated with actimetric features. These findings support wrist-worn actimetry as a digital biomarker of FA that complements questionnaires and may facilitate privacy-preserving clinical translation.
comment: 25 pages, 4 figures, 14 tables
Computer Vision and Pattern Recognition 170
☆ Human3R: Everyone Everywhere All at Once
We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R
comment: Page: https://fanegg.github.io/Human3R Code: https://github.com/fanegg/Human3R
☆ EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.
☆ Dropping the D: RGB-D SLAM Without the Depth Sensor
We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.
☆ Fine-grained Defocus Blur Control for Generative Image Models
Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise interactive user control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.
comment: Project link: https://www.ayshrv.com/defocus-blur-gen
☆ Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models
Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end-to-end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out-of-distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost-effective alternative to real-world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.
comment: Accepted by IROS 2025
☆ ShapeGen4D: Towards High Quality 4D Shape Generation from Videos
Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.
comment: Project page: https://shapegen4d.github.io/
☆ Overlap-aware segmentation for topological reconstruction of obscured objects
The separation of overlapping objects presents a significant challenge in scientific imaging. While deep learning segmentation-regression algorithms can predict pixel-wise intensities, they typically treat all regions equally rather than prioritizing overlap regions where attribution is most ambiguous. Recent advances in instance segmentation show that weighting regions of pixel overlap in training can improve segmentation boundary predictions in regions of overlap, but this idea has not yet been extended to segmentation regression. We address this with Overlap-Aware Segmentation of ImageS (OASIS): a new segmentation-regression framework with a weighted loss function designed to prioritize regions of object-overlap during training, enabling extraction of pixel intensities and topological features from heavily obscured objects. We demonstrate OASIS in the context of the MIGDAL experiment, which aims to directly image the Migdal effect--a rare process where electron emission is induced by nuclear scattering--in a low-pressure optical time projection chamber. This setting poses an extreme test case, as the target for reconstruction is a faint electron recoil track which is often heavily-buried within the orders-of-magnitude brighter nuclear recoil track. Compared to unweighted training, OASIS improves median intensity reconstruction errors from -32% to -14% for low-energy electron tracks (4-5 keV) and improves topological intersection-over-union scores from 0.828 to 0.855. These performance gains demonstrate OASIS's ability to recover obscured signals in overlap-dominated regions. The framework provides a generalizable methodology for scientific imaging where pixels represent physical quantities and overlap obscures features of interest. All code is openly available to facilitate cross-domain adoption.
☆ Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2
Smartphone-based iris recognition in the visible spectrum (VIS) remains difficult due to illumination variability, pigmentation differences, and the absence of standardized capture controls. This work presents a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance at acquisition and demonstrates that accurate VIS iris recognition is feasible on commodity devices. Using a custom Android application performing real-time framing, sharpness evaluation, and feedback, we introduce the CUVIRIS dataset of 752 compliant images from 47 subjects. A lightweight MobileNetV3-based multi-task segmentation network (LightIrisNet) is developed for efficient on-device processing, and a transformer matcher (IrisFormer) is adapted to the VIS domain. Under a standardized protocol and comparative benchmarking against prior CNN baselines, OSIRIS attains a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS. The acquisition app, trained models, and a public subset of the dataset are released to support reproducibility. These results confirm that standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.
comment: We build upon our earlier work, arXiv:2412.13063
☆ Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images
We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.
comment: Project page: https://ap229997.github.io/projects/forehand4d
☆ Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a $\mathcal{J}\&\mathcal{F}$ of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.
☆ Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.
comment: 16 pages,6 figures
☆ Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework BMVC2025
Deep learning in medical imaging is often limited by scarce and imbalanced annotated data. We present SSGNet, a unified framework that combines class specific generative modeling with iterative semisupervised pseudo labeling to enhance both classification and segmentation. Rather than functioning as a standalone model, SSGNet augments existing baselines by expanding training data with StyleGAN3 generated images and refining labels through iterative pseudo labeling. Experiments across multiple medical imaging benchmarks demonstrate consistent gains in classification and segmentation performance, while Frechet Inception Distance analysis confirms the high quality of generated samples. These results highlight SSGNet as a practical strategy to mitigate annotation bottlenecks and improve robustness in medical image analysis.
comment: Accepted at BMVC2025
☆ Multimodal Feature Prototype Learning for Interpretable and Discriminative Cancer Survival Prediction
Survival analysis plays a vital role in making clinical decisions. However, the models currently in use are often difficult to interpret, which reduces their usefulness in clinical settings. Prototype learning presents a potential solution, yet traditional methods focus on local similarities and static matching, neglecting the broader tumor context and lacking strong semantic alignment with genomic data. To overcome these issues, we introduce an innovative prototype-based multimodal framework, FeatProto, aimed at enhancing cancer survival prediction by addressing significant limitations in current prototype learning methodologies within pathology. Our framework establishes a unified feature prototype space that integrates both global and local features of whole slide images (WSI) with genomic profiles. This integration facilitates traceable and interpretable decision-making processes. Our approach includes three main innovations: (1) A robust phenotype representation that merges critical patches with global context, harmonized with genomic data to minimize local bias. (2) An Exponential Prototype Update Strategy (EMA ProtoUp) that sustains stable cross-modal associations and employs a wandering mechanism to adapt prototypes flexibly to tumor heterogeneity. (3) A hierarchical prototype matching scheme designed to capture global centrality, local typicality, and cohort-level trends, thereby refining prototype inference. Comprehensive evaluations on four publicly available cancer datasets indicate that our method surpasses current leading unimodal and multimodal survival prediction techniques in both accuracy and interoperability, providing a new perspective on prototype learning for critical medical applications. Our source code is available at https://github.com/JSLiam94/FeatProto.
comment: 12 pages, 10 figures
☆ Compact Multi-level-prior Tensor Representation for Hyperspectral Image Super-resolution
Fusing a hyperspectral image with a multispectral image acquired over the same scene, \textit{i.e.}, hyperspectral image super-resolution, has become a popular computational way to access the latent high-spatial-spectral-resolution image. To date, a variety of fusion methods have been proposed, among which the tensor-based ones have testified that multiple priors, such as multidimensional low-rankness and spatial total variation at multiple levels, effectively drive the fusion process. However, existing tensor-based models can only effectively leverage one or two priors at one or two levels, since simultaneously incorporating multi-level priors inevitably increases model complexity. This introduces challenges in both balancing the weights of different priors and optimizing multi-block structures. Concerning this, we present a novel hyperspectral super-resolution model compactly characterizing these multi-level priors of hyperspectral images within the tensor framework. Firstly, the proposed model decouples the spectral low-rankness and spatial priors by casting the latent high-spatial-spectral-resolution image into spectral subspace and spatial maps via block term decomposition. Secondly, these spatial maps are stacked as the spatial tensor encoding the high-order spatial low-rankness and smoothness priors, which are co-modeled via the proposed non-convex mode-shuffled tensor correlated total variation. Finally, we draw inspiration from the linearized alternating direction method of multipliers to design an efficient algorithm to optimize the resulting model, theoretically proving its Karush-Kuhn-Tucker convergence under mild conditions. Experiments on multiple datasets demonstrate the effectiveness of the proposed algorithm. The code implementation will be available from https://github.com/WongYinJ.
☆ A public cardiac CT dataset featuring the left atrial appendage
Despite the success of advanced segmentation frameworks such as TotalSegmentator (TS), accurate segmentations of the left atrial appendage (LAA), coronary arteries (CAs), and pulmonary veins (PVs) remain a significant challenge in medical imaging. In this work, we present the first open-source, anatomically coherent dataset of curated, high-resolution segmentations for these structures, supplemented with whole-heart labels produced by TS on the publicly available ImageCAS dataset consisting of 1000 cardiac computed tomography angiography (CCTA) scans. One purpose of the data set is to foster novel approaches to the analysis of LAA morphology. LAA segmentations on ImageCAS were generated using a state-of-the-art segmentation framework developed specifically for high resolution LAA segmentation. We trained the network on a large private dataset with manual annotations provided by medical readers guided by a trained cardiologist and transferred the model to ImageCAS data. CA labels were improved from the original ImageCAS annotations, while PV segmentations were refined from TS outputs. In addition, we provide a list of scans from ImageCAS that contains common data flaws such as step artefacts, LAAs extending beyond the scanner's field of view, and other types of data defects.
comment: 8 pages, 5 figures, published at STACOM2025
☆ When Thinking Drifts: Evidential Grounding for Robust Video Reasoning NeurIPS 2025
Video reasoning, the task of enabling machines to infer from dynamic visual content through multi-step logic, is crucial for advanced AI. While the Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks, its application to video understanding remains underexplored. This paper presents a systematic analysis revealing that CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues, and leading to hallucinated visual details and overridden correct intuitions - a phenomenon we term "visual thinking drift". We explain this drift through a Bayesian lens, positing that CoT traces often diverge from actual visual evidence, instead amplifying internal biases or language priors, causing models to storytell rather than engage in grounded reasoning. To counteract this, we introduce Visual Evidence Reward (VER), a novel reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence. Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence - for large multimodal models that not only "think before answering", but also "see while thinking".
comment: Accepted by NeurIPS 2025, Project page: https://vision.cs.utexas.edu/projects/video-ver/
☆ There is More to Attention: Statistical Filtering Enhances Explanations in Vision Transformers
Explainable AI (XAI) has become increasingly important with the rise of large transformer models, yet many explanation methods designed for CNNs transfer poorly to Vision Transformers (ViTs). Existing ViT explanations often rely on attention weights, which tend to yield noisy maps as they capture token-to-token interactions within each layer.While attribution methods incorporating MLP blocks have been proposed, we argue that attention remains a valuable and interpretable signal when properly filtered. We propose a method that combines attention maps with a statistical filtering, initially proposed for CNNs, to remove noisy or uninformative patterns and produce more faithful explanations. We further extend our approach with a class-specific variant that yields discriminative explanations. Evaluation against popular state-of-the-art methods demonstrates that our approach produces sharper and more interpretable maps. In addition to perturbation-based faithfulness metrics, we incorporate human gaze data to assess alignment with human perception, arguing that human interpretability remains essential for XAI. Across multiple datasets, our approach consistently outperforms or is comparable to the SOTA methods while remaining efficient and human plausible.
☆ Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA
CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.
comment: 14pages, 11figures
☆ Medical Vision Language Models as Policies for Robotic Surgery
Vision-based Proximal Policy Optimization (PPO) struggles with visual observation-based robotic laparoscopic surgical tasks due to the high-dimensional nature of visual input, the sparsity of rewards in surgical environments, and the difficulty of extracting task-relevant features from raw visual data. We introduce a simple approach integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. Our method is evaluated on five diverse laparoscopic surgery task environments in LapGym, using only endoscopic visual observations. MedFlamingo PPO outperforms and converges faster compared to both standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates exceeding 70% across all environments, with improvements ranging from 66.67% to 1114.29% compared to baseline. By processing task observations and instructions once per episode to generate high-level planning tokens, our method efficiently combines medical expertise with real-time visual feedback. Our results highlight the value of specialized medical knowledge in robotic surgical planning and decision-making.
comment: IEEE CAI 2025
☆ Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.
☆ GLVD: Guided Learned Vertex Descent
Existing 3D face modeling methods usually depend on 3D Morphable Models, which inherently constrain the representation capacity to fixed shape priors. Optimization-based approaches offer high-quality reconstructions but tend to be computationally expensive. In this work, we introduce GLVD, a hybrid method for 3D face reconstruction from few-shot images that extends Learned Vertex Descent (LVD) by integrating per-vertex neural field optimization with global structural guidance from dynamically predicted 3D keypoints. By incorporating relative spatial encoding, GLVD iteratively refines mesh vertices without requiring dense 3D supervision. This enables expressive and adaptable geometry reconstruction while maintaining computational efficiency. GLVD achieves state-of-the-art performance in single-view settings and remains highly competitive in multi-view scenarios, all while substantially reducing inference time.
☆ VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization ICCV 2025
Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.
comment: Accepted by ICCV 2025
☆ Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between
We introduce Universal Neural Architecture Space (UniNAS), a generic search space for neural architecture search (NAS) which unifies convolutional networks, transformers, and their hybrid architectures under a single, flexible framework. Our approach enables discovery of novel architectures as well as analyzing existing architectures in a common framework. We also propose a new search algorithm that allows traversing the proposed search space, and demonstrate that the space contains interesting architectures, which, when using identical training setup, outperform state-of-the-art hand-crafted architectures. Finally, a unified toolkit including a standardized training and evaluation protocol is introduced to foster reproducibility and enable fair comparison in NAS research. Overall, this work opens a pathway towards systematically exploring the full spectrum of neural architectures with a unified graph-based NAS perspective.
☆ Emergent AI Surveillance: Overlearned Person Re-Identification and Its Mitigation in Law Enforcement Context
Generic instance search models can dramatically reduce the manual effort required to analyze vast surveillance footage during criminal investigations by retrieving specific objects of interest to law enforcement. However, our research reveals an unintended emergent capability: through overlearning, these models can single out specific individuals even when trained on datasets without human subjects. This capability raises concerns regarding identification and profiling of individuals based on their personal data, while there is currently no clear standard on how de-identification can be achieved. We evaluate two technical safeguards to curtail a model's person re-identification capacity: index exclusion and confusion loss. Our experiments demonstrate that combining these approaches can reduce person re-identification accuracy to below 2% while maintaining 82% of retrieval performance for non-person objects. However, we identify critical vulnerabilities in these mitigations, including potential circumvention using partial person images. These findings highlight urgent regulatory questions at the intersection of AI governance and data protection: How should we classify and regulate systems with emergent identification capabilities? And what technical standards should be required to prevent identification capabilities from developing in seemingly benign applications?
comment: 10 pages, accepted to AIES 2025
☆ Continual Learning for Image Captioning through Improved Image-Text Alignment
Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link https://github.com/ Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.
comment: 11 pages, 3 figures
☆ Detection and Measurement of Hailstones with Multimodal Large Language Models
This study examines the use of social media and news images to detect and measure hailstones, utilizing pre-trained multimodal large language models. The dataset for this study comprises 474 crowdsourced images of hailstones from documented hail events in Austria, which occurred between January 2022 and September 2024. These hailstones have maximum diameters ranging from 2 to 11cm. We estimate the hail diameters and compare four different models utilizing one-stage and two-stage prompting strategies. The latter utilizes additional size cues from reference objects, such as human hands, within the image. Our results show that pretrained models already have the potential to measure hailstone diameters from images with an average mean absolute error of 1.12cm for the best model. In comparison to a single-stage prompt, two-stage prompting improves the reliability of most models. Our study suggests that these off-the-shelf models, even without fine-tuning, can complement traditional hail sensors by extracting meaningful and spatially dense information from social media imagery, enabling faster and more detailed assessments of severe weather events. The automated real-time image harvesting from social media and other sources remains an open task, but it will make our approach directly applicable to future hail events.
comment: 6 pages, 5 figures, accepted at The 2nd International Conference on Electrical and Computer Engineering Researches
Diffusion-Based Image Editing for Breaking Robust Watermarks
Robust invisible watermarking aims to embed hidden information into images such that the watermark can survive various image manipulations. However, the rise of powerful diffusion-based image generation and editing techniques poses a new threat to these watermarking schemes. In this paper, we present a theoretical study and method demonstrating that diffusion models can effectively break robust image watermarks that were designed to resist conventional perturbations. We show that a diffusion-driven ``image regeneration'' process can erase embedded watermarks while preserving perceptual image content. We further introduce a novel guided diffusion attack that explicitly targets the watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion-based transformation, the mutual information between the watermarked image and the embedded watermark payload vanishes, resulting in decoding failure. Experimentally, we evaluate our approach on multiple state-of-the-art watermarking schemes (including the deep learning-based methods StegaStamp, TrustMark, and VINE) and demonstrate near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings highlight a fundamental vulnerability in current robust watermarking techniques against generative model-based attacks, underscoring the need for new watermarking strategies in the era of generative AI.
comment: Preprint
☆ A Dynamic Mode Decomposition Approach to Morphological Component Analysis
This paper introduces a novel methodology of adapting the representation of videos based on the dynamics of their scene content variation. In particular, we demonstrate how the clustering of dynamic mode decomposition eigenvalues can be leveraged to learn an adaptive video representation for separating structurally distinct morphologies of a video. We extend the morphological component analysis (MCA) algorithm, which uses multiple predefined incoherent dictionaries and a sparsity prior to separate distinct sources in signals, by introducing our novel eigenspace clustering technique to obtain data-driven MCA dictionaries, which we call dynamic morphological component analysis (DMCA). After deriving our novel algorithm, we offer a motivational example of DMCA applied to a still image, then demonstrate DMCA's effectiveness in denoising applications on videos from the Adobe 240fps dataset. Afterwards, we provide an example of DMCA enhancing the signal-to-noise ratio of a faint target summed with a sea state, and conclude the paper by applying DMCA to separate a bicycle from wind clutter in inverse synthetic aperture radar images.
Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis
Low-light image enhancement (LLIE) is vital for safety-critical applications such as surveillance, autonomous navigation, and medical imaging, where visibility degradation can impair downstream task performance. Recently, diffusion models have emerged as a promising generative paradigm for LLIE due to their capacity to model complex image distributions via iterative denoising. This survey provides an up-to-date critical analysis of diffusion models for LLIE, distinctively featuring an in-depth comparative performance evaluation against Generative Adversarial Network and Transformer-based state-of-the-art methods, a thorough examination of practical deployment challenges, and a forward-looking perspective on the role of emerging paradigms like foundation models. We propose a multi-perspective taxonomy encompassing six categories: Intrinsic Decomposition, Spectral & Latent, Accelerated, Guided, Multimodal, and Autonomous; that map enhancement methods across physical priors, conditioning schemes, and computational efficiency. Our taxonomy is grounded in a hybrid view of both the model mechanism and the conditioning signals. We evaluate qualitative failure modes, benchmark inconsistencies, and trade-offs between interpretability, generalization, and inference efficiency. We also discuss real-world deployment constraints (e.g., memory, energy use) and ethical considerations. This survey aims to guide the next generation of diffusion-based LLIE research by highlighting trends and surfacing open research questions, including novel conditioning, real-time adaptation, and the potential of foundation models.
☆ Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging
The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions. Our code is available on GitHub.
comment: Code and data: https://github.com/multimodallearning/MetaFormerMedImaging/tree/clean_code
Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $x$ efficiently and in closed-form using the model's Jacobian matrix at $x$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.
☆ A Warm-basis Method for Bridging Learning and Iteration: a Case Study in Fluorescence Molecular Tomography
Fluorescence Molecular Tomography (FMT) is a widely used non-invasive optical imaging technology in biomedical research. It usually faces significant accuracy challenges in depth reconstruction, and conventional iterative methods struggle with poor $z$-resolution even with advanced regularization. Supervised learning approaches can improve recovery accuracy but rely on large, high-quality paired training dataset that is often impractical to acquire in practice. This naturally raises the question of how learning-based approaches can be effectively combined with iterative schemes to yield more accurate and stable algorithms. In this work, we present a novel warm-basis iterative projection method (WB-IPM) and establish its theoretical underpinnings. The method is able to achieve significantly more accurate reconstructions than the learning-based and iterative-based methods. In addition, it allows a weaker loss function depending solely on the directional component of the difference between ground truth and neural network output, thereby substantially reducing the training effort. These features are justified by our error analysis as well as simulated and real-data experiments.
☆ Kaputt: A Large-Scale Dataset for Visual Defect Detection ICCV 2025
We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec-AD and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.
comment: Accepted to ICCV 2025
Efficient Universal Models for Medical Image Segmentation via Weakly Supervised In-Context Learning
Universal models for medical image segmentation, such as interactive and in-context learning (ICL) models, offer strong generalization but require extensive annotations. Interactive models need repeated user prompts for each image, while ICL relies on dense, pixel-level labels. To address this, we propose Weakly Supervised In-Context Learning (WS-ICL), a new ICL paradigm that leverages weak prompts (e.g., bounding boxes or points) instead of dense labels for context. This approach significantly reduces annotation effort by eliminating the need for fine-grained masks and repeated user prompting for all images. We evaluated the proposed WS-ICL model on three held-out benchmarks. Experimental results demonstrate that WS-ICL achieves performance comparable to regular ICL models at a significantly lower annotation cost. In addition, WS-ICL is highly competitive even under the interactive paradigm. These findings establish WS-ICL as a promising step toward more efficient and unified universal models for medical image segmentation. Our code and model are publicly available at https://github.com/jiesihu/Weak-ICL.
☆ $\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection ICCV2025
The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D$^3$QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of D$^3$QE across different AR models, with robustness to real-world perturbations. Code is available at \href{https://github.com/Zhangyr2022/D3QE}{https://github.com/Zhangyr2022/D3QE}.
comment: 10 pages, 5 figures, published to ICCV2025
☆ BioAutoML-NAS: An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data
Insect classification is important for agricultural management and ecological research, as it directly affects crop health and production. However, this task remains challenging due to the complex characteristics of insects, class imbalance, and large-scale datasets. To address these issues, we propose BioAutoML-NAS, the first BioAutoML model using multimodal data, including images, and metadata, which applies neural architecture search (NAS) for images to automatically learn the best operations for each connection within each cell. Multiple cells are stacked to form the full network, each extracting detailed image feature representations. A multimodal fusion module combines image embeddings with metadata, allowing the model to use both visual and categorical biological information to classify insects. An alternating bi-level optimization training strategy jointly updates network weights and architecture parameters, while zero operations remove less important connections, producing sparse, efficient, and high-performing architectures. Extensive evaluation on the BIOSCAN-5M dataset demonstrates that BioAutoML-NAS achieves 96.81% accuracy, 97.46% precision, 96.81% recall, and a 97.05% F1 score, outperforming state-of-the-art transfer learning, transformer, AutoML, and NAS methods by approximately 16%, 10%, and 8% respectively. Further validation on the Insects-1M dataset obtains 93.25% accuracy, 93.71% precision, 92.74% recall, and a 93.22% F1 score. These results demonstrate that BioAutoML-NAS provides accurate, confident insect classification that supports modern sustainable farming.
☆ acia-workflows: Automated Single-cell Imaging Analysis for Scalable and Deep Learning-based Live-cell Imaging Analysis Workflows
Live-cell imaging (LCI) technology enables the detailed spatio-temporal characterization of living cells at the single-cell level, which is critical for advancing research in the life sciences, from biomedical applications to bioprocessing. High-throughput setups with tens to hundreds of parallel cell cultivations offer the potential for robust and reproducible insights. However, these insights are obscured by the large amount of LCI data recorded per experiment. Recent advances in state-of-the-art deep learning methods for cell segmentation and tracking now enable the automated analysis of such large data volumes, offering unprecedented opportunities to systematically study single-cell dynamics. The next key challenge lies in integrating these powerful tools into accessible, flexible, and user-friendly workflows that support routine application in biological research. In this work, we present acia-workflows, a platform that combines three key components: (1) the Automated live-Cell Imaging Analysis (acia) Python library, which supports the modular design of image analysis pipelines offering eight deep learning segmentation and tracking approaches; (2) workflows that assemble the image analysis pipeline, its software dependencies, documentation, and visualizations into a single Jupyter Notebook, leading to accessible, reproducible and scalable analysis workflows; and (3) a collection of application workflows showcasing the analysis and customization capabilities in real-world applications. Specifically, we present three workflows to investigate various types of microfluidic LCI experiments ranging from growth rate comparisons to precise, minute-resolution quantitative analyses of individual dynamic cells responses to changing oxygen conditions. Our collection of more than ten application workflows is open source and publicly available at https://github.com/JuBiotech/acia-workflows.
☆ The Safety Challenge of World Models for Embodied AI Agents: A Review
The rapid progress in embodied artificial intelligence has highlighted the necessity for more advanced and integrated models that can perceive, interpret, and predict environmental dynamics. In this context, World Models (WMs) have been introduced to provide embodied agents with the abilities to anticipate future environmental states and fill in knowledge gaps, thereby enhancing agents' ability to plan and execute actions. However, when dealing with embodied agents it is fundamental to ensure that predictions are safe for both the agent and the environment. In this article, we conduct a comprehensive literature review of World Models in the domains of autonomous driving and robotics, with a specific focus on the safety implications of scene and control generation tasks. Our review is complemented by an empirical analysis, wherein we collect and examine predictions from state-of-the-art models, identify and categorize common faults (herein referred to as pathologies), and provide a quantitative evaluation of the results.
☆ Towards Robust and Realible Multimodal Fake News Detection with Incomplete Modality
Multimodal fake news detection (MFND) has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of fake news detection in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.
☆ Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow ICCV' 2025
Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.
comment: Accepted to ICCV' 2025
☆ FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.
comment: Acepted at IJCNN 2025
☆ StereoSync: Spatially-Aware Stereo Audio Generation from Video
Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.
comment: Accepted at IJCNN 2025
☆ Leveraging Vision Transformers for Enhanced Classification of Emotions using ECG Signals
Biomedical signals provide insights into various conditions affecting the human body. Beyond diagnostic capabilities, these signals offer a deeper understanding of how specific organs respond to an individual's emotions and feelings. For instance, ECG data can reveal changes in heart rate variability linked to emotional arousal, stress levels, and autonomic nervous system activity. This data offers a window into the physiological basis of our emotional states. Recent advancements in the field diverge from conventional approaches by leveraging the power of advanced transformer architectures, which surpass traditional machine learning and deep learning methods. We begin by assessing the effectiveness of the Vision Transformer (ViT), a forefront model in image classification, for identifying emotions in imaged ECGs. Following this, we present and evaluate an improved version of ViT, integrating both CNN and SE blocks, aiming to bolster performance on imaged ECGs associated with emotion detection. Our method unfolds in two critical phases: first, we apply advanced preprocessing techniques for signal purification and converting signals into interpretable images using continuous wavelet transform and power spectral density analysis; second, we unveil a performance-boosted vision transformer architecture, cleverly enhanced with convolutional neural network components, to adeptly tackle the challenges of emotion recognition. Our methodology's robustness and innovation were thoroughly tested using ECG data from the YAAD and DREAMER datasets, leading to remarkable outcomes. For the YAAD dataset, our approach outperformed existing state-of-the-art methods in classifying seven unique emotional states, as well as in valence and arousal classification. Similarly, in the DREAMER dataset, our method excelled in distinguishing between valence, arousal and dominance, surpassing current leading techniques.
comment: 14pages, 2 figures
☆ Deformable Image Registration for Self-supervised Cardiac Phase Detection in Multi-View Multi-Disease Cardiac Magnetic Resonance Images
Cardiovascular magnetic resonance (CMR) is the gold standard for assessing cardiac function, but individual cardiac cycles complicate automatic temporal comparison or sub-phase analysis. Accurate cardiac keyframe detection can eliminate this problem. However, automatic methods solely derive end-systole (ES) and end-diastole (ED) frames from left ventricular volume curves, which do not provide a deeper insight into myocardial motion. We propose a self-supervised deep learning method detecting five keyframes in short-axis (SAX) and four-chamber long-axis (4CH) cine CMR. Initially, dense deformable registration fields are derived from the images and used to compute a 1D motion descriptor, which provides valuable insights into global cardiac contraction and relaxation patterns. From these characteristic curves, keyframes are determined using a simple set of rules. The method was independently evaluated for both views using three public, multicentre, multidisease datasets. M&Ms-2 (n=360) dataset was used for training and evaluation, and M&Ms (n=345) and ACDC (n=100) datasets for repeatability control. Furthermore, generalisability to patients with rare congenital heart defects was tested using the German Competence Network (GCN) dataset. Our self-supervised approach achieved improved detection accuracy by 30% - 51% for SAX and 11% - 47% for 4CH in ED and ES, as measured by cyclic frame difference (cFD), compared with the volume-based approach. We can detect ED and ES, as well as three additional keyframes throughout the cardiac cycle with a mean cFD below 1.31 frames for SAX and 1.73 for LAX. Our approach enables temporally aligned inter- and intra-patient analysis of cardiac dynamics, irrespective of cycle or phase lengths. GitHub repository: https://github.com/Cardio-AI/cmr-multi-view-phase-detection.git
comment: Main 30 pages, 6 figures
☆ Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
The Steered Mixture of Experts regression framework has demonstrated strong performance in image reconstruction, compression, denoising, and super-resolution. However, its high computational cost limits practical applications. This work introduces a rasterization-based optimization strategy that combines the efficiency of rasterized Gaussian kernel rendering with the edge-aware gating mechanism of the Steered Mixture of Experts. The proposed method is designed to accelerate two-dimensional image regression while maintaining the model's inherent sparsity and reconstruction quality. By replacing global iterative optimization with a rasterized formulation, the method achieves significantly faster parameter updates and more memory-efficient model representations. In addition, the proposed framework supports applications such as native super-resolution and image denoising, which are not directly achievable with standard rasterized Gaussian kernel approaches. The combination of fast rasterized optimization with the edge-aware structure of the Steered Mixture of Experts provides a new balance between computational efficiency and reconstruction fidelity for two-dimensional image processing tasks.
☆ Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates AISTATS 2026
Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B\'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B\'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.
comment: 20 pages, 4 figures, Submitted to AISTATS 2026
☆ Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection
Out-of-distribution (OOD) detection is essential for reliably deploying machine learning models in the wild. Yet, most methods treat large pre-trained models as monolithic encoders and rely solely on their final-layer representations for detection. We challenge this wisdom. We reveal the \textit{intermediate layers} of pre-trained models, shaped by residual connections that subtly transform input projections, \textit{can} encode \textit{surprisingly rich and diverse signals} for detecting distributional shifts. Importantly, to exploit latent representation diversity across layers, we introduce an entropy-based criterion to \textit{automatically} identify layers offering the most complementary information in a training-free setting -- \textit{without access to OOD data}. We show that selectively incorporating these intermediate representations can increase the accuracy of OOD detection by up to \textbf{$10\%$} in far-OOD and over \textbf{$7\%$} in near-OOD benchmarks compared to state-of-the-art training-free methods across various model architectures and training objectives. Our findings reveal a new avenue for OOD detection research and uncover the impact of various training objectives and model architectures on confidence-based OOD detection methods.
comment: 28
☆ A Novel Technique for Robust Training of Deep Networks With Multisource Weak Labeled Remote Sensing Data
Deep learning has gained broad interest in remote sensing image scene classification thanks to the effectiveness of deep neural networks in extracting the semantics from complex data. However, deep networks require large amounts of training samples to obtain good generalization capabilities and are sensitive to errors in the training labels. This is a problem in remote sensing since highly reliable labels can be obtained at high costs and in limited amount. However, many sources of less reliable labeled data are available, e.g., obsolete digital maps. In order to train deep networks with larger datasets, we propose both the combination of single or multiple weak sources of labeled data with a small but reliable dataset to generate multisource labeled datasets and a novel training strategy where the reliability of each source is taken in consideration. This is done by exploiting the transition matrices describing the statistics of the errors of each source. The transition matrices are embedded into the labels and used during the training process to weigh each label according to the related source. The proposed method acts as a weighting scheme at gradient level, where each instance contributes with different weights to the optimization of different classes. The effectiveness of the proposed method is validated by experiments on different datasets. The results proved the robustness and capability of leveraging on unreliable source of labels of the proposed method.
comment: 16 pages, 9 figures, accepted article
☆ OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search
Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.
☆ ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving
The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).
☆ Redefining Generalization in Visual Domains: A Two-Axis Framework for Fake Image Detection with FusionDetect
The rapid development of generative models has made it increasingly crucial to develop detectors that can reliably detect synthetic images. Although most of the work has now focused on cross-generator generalization, we argue that this viewpoint is too limited. Detecting synthetic images involves another equally important challenge: generalization across visual domains. To bridge this gap,we present the OmniGen Benchmark. This comprehensive evaluation dataset incorporates 12 state-of-the-art generators, providing a more realistic way of evaluating detector performance under realistic conditions. In addition, we introduce a new method, FusionDetect, aimed at addressing both vectors of generalization. FusionDetect draws on the benefits of two frozen foundation models: CLIP & Dinov2. By deriving features from both complementary models,we develop a cohesive feature space that naturally adapts to changes in both thecontent and design of the generator. Our extensive experiments demonstrate that FusionDetect delivers not only a new state-of-the-art, which is 3.87% more accurate than its closest competitor and 6.13% more precise on average on established benchmarks, but also achieves a 4.48% increase in accuracy on OmniGen,along with exceptional robustness to common image perturbations. We introduce not only a top-performing detector, but also a new benchmark and framework for furthering universal AI image detection. The code and dataset are available at http://github.com/amir-aman/FusionDetect
comment: Project code: http://github.com/amir-aman/FusionDetect
☆ Data Factory with Minimal Human Effort Using VLMs
Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
comment: Tech report
☆ Neighborhood-Adaptive Generalized Linear Graph Embedding with Latent Pattern Mining
Graph embedding has been widely applied in areas such as network analysis, social network mining, recommendation systems, and bioinformatics. However, current graph construction methods often require the prior definition of neighborhood size, limiting the effective revelation of potential structural correlations in the data. Additionally, graph embedding methods using linear projection heavily rely on a singular pattern mining approach, resulting in relative weaknesses in adapting to different scenarios. To address these challenges, we propose a novel model, Neighborhood-Adaptive Generalized Linear Graph Embedding (NGLGE), grounded in latent pattern mining. This model introduces an adaptive graph learning method tailored to the neighborhood, effectively revealing intrinsic data correlations. Simultaneously, leveraging a reconstructed low-rank representation and imposing $\ell_{2,0}$ norm constraint on the projection matrix allows for flexible exploration of additional pattern information. Besides, an efficient iterative solving algorithm is derived for the proposed model. Comparative evaluations on datasets from diverse scenarios demonstrate the superior performance of our model compared to state-of-the-art methods.
☆ AgeBooth: Controllable Facial Aging and Rejuvenation via Diffusion Models
Recent diffusion model research focuses on generating identity-consistent images from a reference photo, but they struggle to accurately control age while preserving identity, and fine-tuning such models often requires costly paired images across ages. In this paper, we propose AgeBooth, a novel age-specific finetuning approach that can effectively enhance the age control capability of adapterbased identity personalization models without the need for expensive age-varied datasets. To reduce dependence on a large amount of age-labeled data, we exploit the linear nature of aging by introducing age-conditioned prompt blending and an age-specific LoRA fusion strategy that leverages SVDMix, a matrix fusion technique. These techniques enable high-quality generation of intermediate-age portraits. Our AgeBooth produces realistic and identity-consistent face images across different ages from a single reference image. Experiments show that AgeBooth achieves superior age control and visual quality compared to previous state-of-the-art editing-based methods.
☆ D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/
☆ Context Matters: Learning Global Semantics for Visual Reasoning and Comprehension
Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper, we argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes, and such a gap can be narrowed through the design of a semantic-grounded objective. Specifically, we notice that individual words in natural language are inherently semantic, and modeling directly on word tokens naturally learns a realistic distribution. In contrast, ViTs rely on spatial patchification, which inevitably lacks semantic information. To bridge this gap, we propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements. We investigate our hypotheses via masked image modeling (MIM), a framework where our approach can be readily tested by applying masks to visual objects rather than random patches. Considerable evidence from qualitative and quantitative evaluations reveals a key finding: object-level representation alone helps to learn a real-world distribution, whereas pixel-averaging shortcuts are often learned without it. Moreover, further evaluations with multimodal LLMs (MLLM) on visual question answering (VQA, GQA, ScienceQA) tasks demonstrate the strong reasoning and contextual understanding gained with this simple objective. We hope our study highlights the effectiveness of object-level encoding and provides a plausible direction for developing stronger vision encoders and tokenizers. Code and model will be publicly released. Keywords: Semantic Visual Tokenizer, Vision Reasoning, In-context Learning, Multimodal Reasoning
☆ Development and Validation of a Low-Cost Imaging System for Seedling Germination Kinetics through Time-Cumulative Analysis
The study investigates the effects of R. solani inoculation on the germination and early development of Lactuca sativa L. seeds using a low-cost, image-based monitoring system. Multiple cameras were deployed to continuously capture images of the germination process in both infected and control groups. The objective was to assess the impact of the pathogen by analyzing germination dynamics and growth over time. To achieve this, a novel image analysis pipeline was developed. The algorithm integrates both morphological and spatial features to identify and quantify individual seedlings, even under complex conditions where traditional image analyses fails. A key innovation of the method lies in its temporal integration: each analysis step considers not only the current status but also their developmental across prior time points. This approach enables robust discrimination of individual seedlings, especially when overlapping leaves significantly hinder object separation. The method demonstrated high accuracy in seedling counting and vigor assessment, even in challenging scenarios characterized by dense and intertwined growth. Results confirm that R. solani infection significantly reduces germination rates and early seedling vigor. The study also validates the feasibility of combining low-cost imaging hardware with advanced computational tools to obtain phenotyping data in a non-destructive and scalable manner. The temporal integration enabled accurate quantification of germinated seeds and precise determination of seedling emergence timing. This approach proved particularly effective in later stages of the experiment, where conventional segmentation techniques failed due to overlapping or intertwined seedlings, making accurate counting. The method achieved a coefficient of determination of 0.98 and a root mean square error (RMSE) of 1.12, demonstrating its robustness and reliability.
☆ DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation
Despite the prevalence of transparent object interactions in human everyday life, transparent robotic manipulation research remains limited to short-horizon tasks and basic grasping capabilities.Although some methods have partially addressed these issues, most of them have limitations in generalizability to novel objects and are insufficient for precise long-horizon robot manipulation. To address this limitation, we propose DeLTa (Demonstration and Language-Guided Novel Transparent Object Manipulation), a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning for precise long-horizon manipulation of transparent objects guided by natural task instructions. A key advantage of our method is its single-demonstration approach, which generalizes 6D trajectories to novel transparent objects without requiring category-level priors or additional training. Additionally, we present a task planner that refines the VLM-generated plan to account for the constraints of a single-arm, eye-in-hand robot for long-horizon object manipulation tasks. Through comprehensive evaluation, we demonstrate that our method significantly outperforms existing transparent object manipulation approaches, particularly in long-horizon scenarios requiring precise manipulation capabilities. Project page: https://sites.google.com/view/DeLTa25/
comment: Project page: https://sites.google.com/view/DeLTa25/
☆ When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach
Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.
☆ Teleportraits: Training-Free People Insertion into Any Scene
The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject's identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.
☆ A Hierarchical Geometry-guided Transformer for Histological Subtyping of Primary Liver Cancer
Primary liver malignancies are widely recognized as the most heterogeneous and prognostically diverse cancers of the digestive system. Among these, hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) emerge as the two principal histological subtypes, demonstrating significantly greater complexity in tissue morphology and cellular architecture than other common tumors. The intricate representation of features in Whole Slide Images (WSIs) encompasses abundant crucial information for liver cancer histological subtyping, regarding hierarchical pyramid structure, tumor microenvironment (TME), and geometric representation. However, recent approaches have not adequately exploited these indispensable effective descriptors, resulting in a limited understanding of histological representation and suboptimal subtyping performance. To mitigate these limitations, ARGUS is proposed to advance histological subtyping in liver cancer by capturing the macro-meso-micro hierarchical information within the TME. Specifically, we first construct a micro-geometry feature to represent fine-grained cell-level pattern via a geometric structure across nuclei, thereby providing a more refined and precise perspective for delineating pathological images. Then, a Hierarchical Field-of-Views (FoVs) Alignment module is designed to model macro- and meso-level hierarchical interactions inherent in WSIs. Finally, the augmented micro-geometry and FoVs features are fused into a joint representation via present Geometry Prior Guided Fusion strategy for modeling holistic phenotype interactions. Extensive experiments on public and private cohorts demonstrate that our ARGUS achieves state-of-the-art (SOTA) performance in histological subtyping of liver cancer, which provide an effective diagnostic tool for primary liver malignancies in clinical practice.
comment: 7 pages, 2 figures, accepted by IEEE BIBM 2025
☆ SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets
In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video's spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.
comment: Under review
☆ EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario
Reproducing cognitive development, group interaction, and long-term evolution in virtual classrooms remains a core challenge for educational AI, as real classrooms integrate open-ended cognition, dynamic social interaction, affective factors, and multi-session development rarely captured together. Existing approaches mostly focus on short-term or single-agent settings, limiting systematic study of classroom complexity and cross-task reuse. We present EduVerse, the first user-defined multi-agent simulation space that supports environment, agent, and session customization. A distinctive human-in-the-loop interface further allows real users to join the space. Built on a layered CIE (Cognition-Interaction-Evolution) architecture, EduVerse ensures individual consistency, authentic interaction, and longitudinal adaptation in cognition, emotion, and behavior-reproducing realistic classroom dynamics with seamless human-agent integration. We validate EduVerse in middle-school Chinese classes across three text genres, environments, and multiple sessions. Results show: (1) Instructional alignment: simulated IRF rates (0.28-0.64) closely match real classrooms (0.37-0.49), indicating pedagogical realism; (2) Group interaction and role differentiation: network density (0.27-0.40) with about one-third of peer links realized, while human-agent tasks indicate a balance between individual variability and instructional stability; (3) Cross-session evolution: the positive transition rate R+ increase by 11.7% on average, capturing longitudinal shifts in behavior, emotion, and cognition and revealing structured learning trajectories. Overall, EduVerse balances realism, reproducibility, and interpretability, providing a scalable platform for educational AI. The system will be open-sourced to foster cross-disciplinary research.
comment: Preprint, Under review
☆ Ocular-Induced Abnormal Head Posture: Diagnosis and Missing Data Imputation
Ocular-induced abnormal head posture (AHP) is a compensatory mechanism that arises from ocular misalignment conditions, such as strabismus, enabling patients to reduce diplopia and preserve binocular vision. Early diagnosis minimizes morbidity and secondary complications such as facial asymmetry; however, current clinical assessments remain largely subjective and are further complicated by incomplete medical records. This study addresses both challenges through two complementary deep learning frameworks. First, AHP-CADNet is a multi-level attention fusion framework for automated diagnosis that integrates ocular landmarks, head pose features, and structured clinical attributes to generate interpretable predictions. Second, a curriculum learning-based imputation framework is designed to mitigate missing data by progressively leveraging structured variables and unstructured clinical notes to enhance diagnostic robustness under realistic data conditions. Evaluation on the PoseGaze-AHP dataset demonstrates robust diagnostic performance. AHP-CADNet achieves 96.9-99.0 percent accuracy across classification tasks and low prediction errors for continuous variables, with MAE ranging from 0.103 to 0.199 and R2 exceeding 0.93. The imputation framework maintains high accuracy across all clinical variables (93.46-99.78 percent with PubMedBERT), with clinical dependency modeling yielding significant improvements (p < 0.001). These findings confirm the effectiveness of both frameworks for automated diagnosis and recovery from missing data in clinical settings.
☆ Combined Hyperbolic and Euclidean Soft Triple Loss Beyond the Single Space Deep Metric Learning
Deep metric learning (DML) aims to learn a neural network mapping data to an embedding space, which can represent semantic similarity between data points. Hyperbolic space is attractive for DML since it can represent richer structures, such as tree structures. DML in hyperbolic space is based on pair-based loss or unsupervised regularization loss. On the other hand, supervised proxy-based losses in hyperbolic space have not been reported yet due to some issues in applying proxy-based losses in a hyperbolic space. However, proxy-based losses are attractive for large-scale datasets since they have less training complexity. To address these, this paper proposes the Combined Hyperbolic and Euclidean Soft Triple (CHEST) loss. CHEST loss is composed of the proxy-based losses in hyperbolic and Euclidean spaces and the regularization loss based on hyperbolic hierarchical clustering. We find that the combination of hyperbolic and Euclidean spaces improves DML accuracy and learning stability for both spaces. Finally, we evaluate the CHEST loss on four benchmark datasets, achieving a new state-of-the-art performance.
comment: 12 pages, 4 figures
☆ From Neural Activity to Computation: Biological Reservoirs for Pattern Recognition in Digit Classification ICCV2025
In this paper, we present a biologically grounded approach to reservoir computing (RC), in which a network of cultured biological neurons serves as the reservoir substrate. This system, referred to as biological reservoir computing (BRC), replaces artificial recurrent units with the spontaneous and evoked activity of living neurons. A multi-electrode array (MEA) enables simultaneous stimulation and readout across multiple sites: inputs are delivered through a subset of electrodes, while the remaining ones capture the resulting neural responses, mapping input patterns into a high-dimensional biological feature space. We evaluate the system through a case study on digit classification using a custom dataset. Input images are encoded and delivered to the biological reservoir via electrical stimulation, and the corresponding neural activity is used to train a simple linear classifier. To contextualize the performance of the biological system, we also include a comparison with a standard artificial reservoir trained on the same task. The results indicate that the biological reservoir can effectively support classification, highlighting its potential as a viable and interpretable computational substrate. We believe this work contributes to the broader effort of integrating biological principles into machine learning and aligns with the goals of human-inspired vision by exploring how living neural systems can inform the design of efficient and biologically plausible models.
comment: Accepted at HiCV@ICCV2025
☆ NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.
☆ Beyond Spectral Peaks: Interpreting the Cues Behind Synthetic Image Detection
Over the years, the forensics community has proposed several deep learning-based detectors to mitigate the risks of generative AI. Recently, frequency-domain artifacts (particularly periodic peaks in the magnitude spectrum), have received significant attention, as they have been often considered a strong indicator of synthetic image generation. However, state-of-the-art detectors are typically used as black-boxes, and it still remains unclear whether they truly rely on these peaks. This limits their interpretability and trust. In this work, we conduct a systematic study to address this question. We propose a strategy to remove spectral peaks from images and analyze the impact of this operation on several detectors. In addition, we introduce a simple linear detector that relies exclusively on frequency peaks, providing a fully interpretable baseline free from the confounding influence of deep learning. Our findings reveal that most detectors are not fundamentally dependent on spectral peaks, challenging a widespread assumption in the field and paving the way for more transparent and reliable forensic tools.
☆ InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment
Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git
☆ TFM Dataset: A Novel Multi-task Dataset and Integrated Pipeline for Automated Tear Film Break-Up Segmentation
Tear film break-up (TFBU) analysis is critical for diagnosing dry eye syndrome, but automated TFBU segmentation remains challenging due to the lack of annotated datasets and integrated solutions. This paper introduces the Tear Film Multi-task (TFM) Dataset, the first comprehensive dataset for multi-task tear film analysis, comprising 15 high-resolution videos (totaling 6,247 frames) annotated with three vision tasks: frame-level classification ('clear', 'closed', 'broken', 'blur'), Placido Ring detection, and pixel-wise TFBU area segmentation. Leveraging this dataset, we first propose TF-Net, a novel and efficient baseline segmentation model. TF-Net incorporates a MobileOne-mini backbone with re-parameterization techniques and an enhanced feature pyramid network to achieve a favorable balance between accuracy and computational efficiency for real-time clinical applications. We further establish benchmark performance on the TFM segmentation subset by comparing TF-Net against several state-of-the-art medical image segmentation models. Furthermore, we design TF-Collab, a novel integrated real-time pipeline that synergistically leverages models trained on all three tasks of the TFM dataset. By sequentially orchestrating frame classification for BUT determination, pupil region localization for input standardization, and TFBU segmentation, TF-Collab fully automates the analysis. Experimental results demonstrate the effectiveness of the proposed TF-Net and TF-Collab, providing a foundation for future research in ocular surface diagnostics. Our code and the TFM datasets are available at https://github.com/glory-wan/TF-Net
☆ PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias emphasizes short-range continuity but undermines the model's capacity to capture long-range dependencies, hindering its ability to enforce global structural properties such as symmetry, consistent topology, and large-scale geometric regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. In addition, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, in dense generation with 8,192 points, PointNSP's advantages become even more pronounced, underscoring its scalability potential.
Efficient Conditional Generation on Scale-based Visual Autoregressive Models
Recent advances in autoregressive (AR) models have demonstrated their potential to rival diffusion models in image synthesis. However, for complex spatially-conditioned generation, current AR approaches rely on fine-tuning the pre-trained model, leading to significant training costs. In this paper, we propose the Efficient Control Model (ECM), a plug-and-play framework featuring a lightweight control module that introduces control signals via a distributed architecture. This architecture consists of context-aware attention layers that refine conditional features using real-time generated tokens, and a shared gated feed-forward network (FFN) designed to maximize the utilization of its limited capacity and ensure coherent control feature learning. Furthermore, recognizing the critical role of early-stage generation in determining semantic structure, we introduce an early-centric sampling strategy that prioritizes learning early control sequences. This approach reduces computational cost by lowering the number of training tokens per iteration, while a complementary temperature scheduling during inference compensates for the resulting insufficient training of late-stage tokens. Extensive experiments on scale-based AR models validate that our method achieves high-fidelity and diverse control over image generation, surpassing existing baselines while significantly improving both training and inference efficiency.
☆ HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection
Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.
☆ Improving Chain-of-Thought Efficiency for Autoregressive Image Generation
Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.
☆ CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval
Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP
comment: ACMMM2025(oral)
☆ HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video
Digitizing the physical world into accurate simulation-ready virtual environments offers significant opportunities in a variety of fields such as augmented and virtual reality, gaming, and robotics. However, current 3D reconstruction and scene-understanding methods commonly fall short in one or more critical aspects, such as geometry completeness, object interactivity, physical plausibility, photorealistic rendering, or realistic physical properties for reliable dynamic simulation. To address these limitations, we introduce HoloScene, a novel interactive 3D reconstruction framework that simultaneously achieves these requirements. HoloScene leverages a comprehensive interactive scene-graph representation, encoding object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships. Reconstruction is formulated as an energy-based optimization problem, integrating observational data, physical constraints, and generative priors into a unified, coherent objective. Optimization is efficiently performed via a hybrid approach combining sampling-based exploration with gradient-based refinement. The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints. Evaluations conducted on multiple benchmark datasets demonstrate superior performance, while practical use-cases in interactive gaming and real-time digital-twin manipulation illustrate HoloScene's broad applicability and effectiveness. Project page: https://xiahongchi.github.io/HoloScene.
comment: Project page: https://xiahongchi.github.io/HoloScene
☆ Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.
comment: Project page: https://agenticlearning.ai/midway-network/
☆ nnSAM2: nnUNet-Enhanced One-Prompt SAM2 for Few-shot Multi-Modality Segmentation and Composition Analysis of Lumbar Paraspinal Muscles
Purpose: To develop and validate No-New SAM2 (nnsam2) for few-shot segmentation of lumbar paraspinal muscles using only a single annotated slice per dataset, and to assess its statistical comparability with expert measurements across multi-sequence MRI and multi-protocol CT. Methods: We retrospectively analyzed 1,219 scans (19,439 slices) from 762 participants across six datasets. Six slices (one per dataset) served as labeled examples, while the remaining 19,433 slices were used for testing. In this minimal-supervision setting, nnsam2 used single-slice SAM2 prompts to generate pseudo-labels, which were pooled across datasets and refined through three sequential, independent nnU-Net models. Segmentation performance was evaluated using the Dice similarity coefficient (DSC), and automated measurements-including muscle volume, fat ratio, and CT attenuation-were assessed with two one-sided tests (TOST) and intraclass correlation coefficients (ICC). Results: nnsam2 outperformed vanilla SAM2, its medical variants, TotalSegmentator, and the leading few-shot method, achieving DSCs of 0.94-0.96 on MR images and 0.92-0.93 on CT. Automated and expert measurements were statistically equivalent for muscle volume (MRI/CT), CT attenuation, and Dixon fat ratio (TOST, P < 0.05), with consistently high ICCs (0.86-1.00). Conclusion: We developed nnsam2, a state-of-the-art few-shot framework for multi-modality LPM segmentation, producing muscle volume (MRI/CT), attenuation (CT), and fat ratio (Dixon MRI) measurements that were statistically comparable to expert references. Validated across multimodal, multicenter, and multinational cohorts, and released with open code and data, nnsam2 demonstrated high annotation efficiency, robust generalizability, and reproducibility.
☆ Seeing the Big Picture: Evaluating Multimodal LLMs' Ability to Interpret and Grade Handwritten Student Work
Recent advances in multimodal large language models (MLLMs) raise the question of their potential for grading, analyzing, and offering feedback on handwritten student classwork. This capability would be particularly beneficial in elementary and middle-school mathematics education, where most work remains handwritten, because seeing students' full working of a problem provides valuable insights into their learning processes, but is extremely time-consuming to grade. We present two experiments investigating MLLM performance on handwritten student mathematics classwork. Experiment A examines 288 handwritten responses from Ghanaian middle school students solving arithmetic problems with objective answers. In this context, models achieved near-human accuracy (95%, k = 0.90) but exhibited occasional errors that human educators would be unlikely to make. Experiment B evaluates 150 mathematical illustrations from American elementary students, where the drawings are the answer to the question. These tasks lack single objective answers and require sophisticated visual interpretation as well as pedagogical judgment in order to analyze and evaluate them. We attempted to separate MLLMs' visual capabilities from their pedagogical abilities by first asking them to grade the student illustrations directly, and then by augmenting the image with a detailed human description of the illustration. We found that when the models had to analyze the student illustrations directly, they struggled, achieving only k = 0.20 with ground truth scores, but when given human descriptions, their agreement levels improved dramatically to k = 0.47, which was in line with human-to-human agreement levels. This gap suggests MLLMs can "see" and interpret arithmetic work relatively well, but still struggle to "see" student mathematical illustrations.
☆ Teamwork: Collaborative Diffusion with Low-rank Coordination and Adaptation
Large pretrained diffusion models can provide strong priors beneficial for many graphics applications. However, generative applications such as neural rendering and inverse methods such as SVBRDF estimation and intrinsic image decomposition require additional input or output channels. Current solutions for channel expansion are often application specific and these solutions can be difficult to adapt to different diffusion models or new tasks. This paper introduces Teamwork: a flexible and efficient unified solution for jointly increasing the number of input and output channels as well as adapting a pretrained diffusion model to new tasks. Teamwork achieves channel expansion without altering the pretrained diffusion model architecture by coordinating and adapting multiple instances of the base diffusion model (\ie, teammates). We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address both adaptation and coordination between the different teammates. Furthermore Teamwork supports dynamic (de)activation of teammates. We demonstrate the flexibility and efficiency of Teamwork on a variety of generative and inverse graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic decomposition, neural shading, and intrinsic image synthesis.
☆ Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models
Diffusion models are powerful deep generative models (DGMs) that generate high-fidelity, diverse content. However, unlike classical DGMs, they lack an explicit, tractable low-dimensional latent space that parameterizes the data manifold. This absence limits manifold-aware analysis and operations, such as interpolation and editing. Existing interpolation methods for diffusion models typically follow paths through high-density regions, which are not necessarily aligned with the data manifold and can yield perceptually unnatural transitions. To exploit the data manifold learned by diffusion models, we propose a novel Riemannian metric on the noise space, inspired by recent findings that the Jacobian of the score function captures the tangent spaces to the local data manifold. This metric encourages geodesics in the noise space to stay within or run parallel to the learned data manifold. Experiments on image interpolation show that our metric produces perceptually more natural and faithful transitions than existing density-based and naive baselines.
☆ Human Action Recognition from Point Clouds over Time
Recent research into human action recognition (HAR) has focused predominantly on skeletal action recognition and video-based methods. With the increasing availability of consumer-grade depth sensors and Lidar instruments, there is a growing opportunity to leverage dense 3D data for action recognition, to develop a third way. This paper presents a novel approach for recognizing actions from 3D videos by introducing a pipeline that segments human point clouds from the background of a scene, tracks individuals over time, and performs body part segmentation. The method supports point clouds from both depth sensors and monocular depth estimation. At the core of the proposed HAR framework is a novel backbone for 3D action recognition, which combines point-based techniques with sparse convolutional networks applied to voxel-mapped point cloud sequences. Experiments incorporate auxiliary point features including surface normals, color, infrared intensity, and body part parsing labels, to enhance recognition accuracy. Evaluation on the NTU RGB- D 120 dataset demonstrates that the method is competitive with existing skeletal action recognition algorithms. Moreover, combining both sensor-based and estimated depth inputs in an ensemble setup, this approach achieves 89.3% accuracy when different human subjects are considered for training and testing, outperforming previous point cloud action recognition methods.
☆ ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars
3D Gaussian Splatting (3DGS) has enabled photorealistic and real-time rendering of 3D head avatars. Existing 3DGS-based avatars typically rely on tens of thousands of 3D Gaussian points (Gaussians), with the number of Gaussians fixed after training. However, many practical applications require adjustable levels of detail (LOD) to balance rendering efficiency and visual quality. In this work, we propose "ArchitectHead", the first framework for creating 3D Gaussian head avatars that support continuous control over LOD. Our key idea is to parameterize the Gaussians in a 2D UV feature space and propose a UV feature field composed of multi-level learnable feature maps to encode their latent features. A lightweight neural network-based decoder then transforms these latent features into 3D Gaussian attributes for rendering. ArchitectHead controls the number of Gaussians by dynamically resampling feature maps from the UV feature field at the desired resolutions. This method enables efficient and continuous control of LOD without retraining. Experimental results show that ArchitectHead achieves state-of-the-art (SOTA) quality in self and cross-identity reenactment tasks at the highest LOD, while maintaining near SOTA performance at lower LODs. At the lowest LOD, our method uses only 6.2\% of the Gaussians while the quality degrades moderately (L1 Loss +7.9\%, PSNR --0.97\%, SSIM --0.6\%, LPIPS Loss +24.1\%), and the rendering speed nearly doubles.
☆ Real-Time Glass Detection and Reprojection using Sensor Fusion Onboard Aerial Robots ICRA 2026
Autonomous aerial robots are increasingly being deployed in real-world scenarios, where transparent obstacles present significant challenges to reliable navigation and mapping. These materials pose a unique problem for traditional perception systems because they lack discernible features and can cause conventional depth sensors to fail, leading to inaccurate maps and potential collisions. To ensure safe navigation, robots must be able to accurately detect and map these transparent obstacles. Existing methods often rely on large, expensive sensors or algorithms that impose high computational burdens, making them unsuitable for low Size, Weight, and Power (SWaP) robots. In this work, we propose a novel and computationally efficient framework for detecting and mapping transparent obstacles onboard a sub-300g quadrotor. Our method fuses data from a Time-of-Flight (ToF) camera and an ultrasonic sensor with a custom, lightweight 2D convolution model. This specialized approach accurately detects specular reflections and propagates their depth into corresponding empty regions of the depth map, effectively rendering transparent obstacles visible. The entire pipeline operates in real-time, utilizing only a small fraction of a CPU core on an embedded processor. We validate our system through a series of experiments in both controlled and real-world environments, demonstrating the utility of our method through experiments where the robot maps indoor environments containing glass. Our work is, to our knowledge, the first of its kind to demonstrate a real-time, onboard transparent obstacle mapping system on a low-SWaP quadrotor using only the CPU.
comment: 8 pages, 8 figures, submitted to ICRA 2026
☆ Limited-Angle Tomography Reconstruction via Projector Guided 3D Diffusion
Limited-angle electron tomography aims to reconstruct 3D shapes from 2D projections of Transmission Electron Microscopy (TEM) within a restricted range and number of tilting angles, but it suffers from the missing-wedge problem that causes severe reconstruction artifacts. Deep learning approaches have shown promising results in alleviating these artifacts, yet they typically require large high-quality training datasets with known 3D ground truth which are difficult to obtain in electron microscopy. To address these challenges, we propose TEMDiff, a novel 3D diffusion-based iterative reconstruction framework. Our method is trained on readily available volumetric FIB-SEM data using a simulator that maps them to TEM tilt series, enabling the model to learn realistic structural priors without requiring clean TEM ground truth. By operating directly on 3D volumes, TEMDiff implicitly enforces consistency across slices without the need for additional regularization. On simulated electron tomography datasets with limited angular coverage, TEMDiff outperforms state-of-the-art methods in reconstruction quality. We further demonstrate that a trained TEMDiff model generalizes well to real-world TEM tilts obtained under different conditions and can recover accurate structures from tilt ranges as narrow as 8 degrees, with 2-degree increments, without any retraining or fine-tuning.
comment: 10 pages, 11 figures
☆ LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval
Neural models such as YOLO and HuBERT can be used to detect local properties such as objects ("car") and emotions ("angry") in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., "does the speaker eventually sound happy in this audio clip?"), and ranked retrieval (e.g., "retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected"). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.
☆ From Captions to Keyframes: Efficient Video Summarization via Caption- and Context-Aware Frame Scoring
Efficient video-language understanding requires selecting a small set of frames that retain semantic and contextual information from long videos. We propose KeyScore, a multimodal frame scoring framework that jointly leverages captions and visual context to estimate frame-level importance. By combining semantic similarity, temporal diversity, and contextual drop impact, KeyScore identifies the most informative frames for downstream tasks such as retrieval, captioning, and video-language reasoning. To complement KeyScore, we introduce STACFP (Spatio-Temporal Adaptive Clustering for Frame Proposals), which generates compact and diverse frame candidates for long-form videos. Together, these modules achieve up to 99\% frame reduction compared to full-frame inference and substantially outperform standard 8-frame encoders on MSRVTT, MSVD, and DiDeMo. Our results demonstrate that emphasizing multimodal alignment between visual and textual signals enables scalable, efficient, and caption-grounded video understanding -- without explicit video summarization.
comment: 10 pages, 4 figures
☆ Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation
Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.
☆ Superpixel Integrated Grids for Fast Image Segmentation
Superpixels have long been used in image simplification to enable more efficient data processing and storage. However, despite their computational potential, their irregular spatial distribution has often forced deep learning approaches to rely on specialized training algorithms and architectures, undermining the original motivation for superpixelations. In this work, we introduce a new superpixel-based data structure, SIGRID (Superpixel-Integrated Grid), as an alternative to full-resolution images in segmentation tasks. By leveraging classical shape descriptors, SIGRID encodes both color and shape information of superpixels while substantially reducing input dimensionality. We evaluate SIGRIDs on four benchmark datasets using two popular convolutional segmentation architectures. Our results show that, despite compressing the original data, SIGRIDs not only match but in some cases surpass the performance of pixel-level representations, all while significantly accelerating model training. This demonstrates that SIGRIDs achieve a favorable balance between accuracy and computational efficiency.
☆ Active Next-Best-View Optimization for Risk-Averse Path Planning
Safe navigation in uncertain environments requires planning methods that integrate risk aversion with active perception. In this work, we present a unified framework that refines a coarse reference path by constructing tail-sensitive risk maps from Average Value-at-Risk statistics on an online-updated 3D Gaussian-splat Radiance Field. These maps enable the generation of locally safe and feasible trajectories. In parallel, we formulate Next-Best-View (NBV) selection as an optimization problem on the SE(3) pose manifold, where Riemannian gradient descent maximizes an expected information gain objective to reduce uncertainty most critical for imminent motion. Our approach advances the state-of-the-art by coupling risk-averse path refinement with NBV planning, while introducing scalable gradient decompositions that support efficient online updates in complex environments. We demonstrate the effectiveness of the proposed framework through extensive computational studies.
☆ SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation
We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision -- from coarse 2D or 3D boxes to pixel-level segmentations and depth -- with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at https://oindrilasaha.github.io/SIGMA-Gen/
comment: Webpage: https://oindrilasaha.github.io/SIGMA-Gen/
☆ TDiff: Thermal Plug-And-Play Prior with Patch-Based Diffusion
Thermal images from low-cost cameras often suffer from low resolution, fixed pattern noise, and other localized degradations. Available datasets for thermal imaging are also limited in both size and diversity. To address these challenges, we propose a patch-based diffusion framework (TDiff) that leverages the local nature of these distortions by training on small thermal patches. In this approach, full-resolution images are restored by denoising overlapping patches and blending them using smooth spatial windowing. To our knowledge, this is the first patch-based diffusion framework that models a learned prior for thermal image restoration across multiple tasks. Experiments on denoising, super-resolution, and deblurring demonstrate strong results on both simulated and real thermal data, establishing our method as a unified restoration pipeline.
☆ Road Surface Condition Detection with Machine Learning using New York State Department of Transportation Camera Images and Weather Forecast Data
The New York State Department of Transportation (NYSDOT) has a network of roadside traffic cameras that are used by both the NYSDOT and the public to observe road conditions. The NYSDOT evaluates road conditions by driving on roads and observing live cameras, tasks which are labor-intensive but necessary for making critical operational decisions during winter weather events. However, machine learning models can provide additional support for the NYSDOT by automatically classifying current road conditions across the state. In this study, convolutional neural networks and random forests are trained on camera images and weather data to predict road surface conditions. Models are trained on a hand-labeled dataset of ~22,000 camera images, each classified by human labelers into one of six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed. Model generalizability is prioritized to meet the operational needs of the NYSDOT decision makers, and the weather-related road surface condition model in this study achieves an accuracy of 81.5% on completely unseen cameras.
☆ TransFIRA: Transfer Learning for Face Image Recognizability Assessment
Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder's decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary--aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment -- encoder-specific, accurate, interpretable, and extensible across modalities -- significantly advancing FIQA in accuracy, explainability, and scope.
comment: Project Page: https://transfira.github.io/
☆ Conditional Denoising Diffusion Model-Based Robust MR Image Reconstruction from Highly Undersampled Data
Magnetic Resonance Imaging (MRI) is a critical tool in modern medical diagnostics, yet its prolonged acquisition time remains a critical limitation, especially in time-sensitive clinical scenarios. While undersampling strategies can accelerate image acquisition, they often result in image artifacts and degraded quality. Recent diffusion models have shown promise for reconstructing high-fidelity images from undersampled data by learning powerful image priors; however, most existing approaches either (i) rely on unsupervised score functions without paired supervision or (ii) apply data consistency only as a post-processing step. In this work, we introduce a conditional denoising diffusion framework with iterative data-consistency correction, which differs from prior methods by embedding the measurement model directly into every reverse diffusion step and training the model on paired undersampled-ground truth data. This hybrid design bridges generative flexibility with explicit enforcement of MRI physics. Experiments on the fastMRI dataset demonstrate that our framework consistently outperforms recent state-of-the-art deep learning and diffusion-based methods in SSIM, PSNR, and LPIPS, with LPIPS capturing perceptual improvements more faithfully. These results demonstrate that integrating conditional supervision with iterative consistency updates yields substantial improvements in both pixel-level fidelity and perceptual realism, establishing a principled and practical advance toward robust, accelerated MRI reconstruction.
☆ Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
comment: 33 pages, 13 figures, 10 tables
☆ Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping
Forest structural complexity metrics integrate multiple canopy attributes into a single value that reflects habitat quality and ecosystem function. Spaceborne lidar from the Global Ecosystem Dynamics Investigation (GEDI) has enabled mapping of structural complexity in temperate and tropical forests, but its sparse sampling limits continuous high-resolution mapping. We present a scalable, deep learning framework fusing GEDI observations with multimodal Synthetic Aperture Radar (SAR) datasets to produce global, high-resolution (25 m) wall-to-wall maps of forest structural complexity. Our adapted EfficientNetV2 architecture, trained on over 130 million GEDI footprints, achieves high performance (global R2 = 0.82) with fewer than 400,000 parameters, making it an accessible tool that enables researchers to process datasets at any scale without requiring specialized computing infrastructure. The model produces accurate predictions with calibrated uncertainty estimates across biomes and time periods, preserving fine-scale spatial patterns. It has been used to generate a global, multi-temporal dataset of forest structural complexity from 2015 to 2022. Through transfer learning, this framework can be extended to predict additional forest structural variables with minimal computational cost. This approach supports continuous, multi-temporal monitoring of global forest structural dynamics and provides tools for biodiversity conservation and ecosystem management efforts in a changing climate.
☆ RGBD Gaze Tracking Using Transformer for Feature Fusion
Subject of this thesis is the implementation of an AI-based Gaze Tracking system using RGBD images that contain both color (RGB) and depth (D) information. To fuse the features extracted from the images, a module based on the Transformer architecture is used. The combination of RGBD input images and Transformers was chosen because it has not yet been investigated. Furthermore, a new dataset is created for training the AI models as existing datasets either do not contain depth information or only contain labels for Gaze Point Estimation that are not suitable for the task of Gaze Angle Estimation. Various model configurations are trained, validated and evaluated on a total of three different datasets. The trained models are then to be used in a real-time pipeline to estimate the gaze direction and thus the gaze point of a person in front of a computer screen. The AI model architecture used in this thesis is based on an earlier work by Lian et al. It uses a Generative Adversarial Network (GAN) to simultaneously remove depth map artifacts and extract head pose features. Lian et al. achieve a mean Euclidean error of 38.7mm on their own dataset ShanghaiTechGaze+. In this thesis, a model architecture with a Transformer module for feature fusion achieves a mean Euclidean error of 55.3mm on the same dataset, but we show that using no pre-trained GAN module leads to a mean Euclidean error of 30.1mm. Replacing the Transformer module with a Multilayer Perceptron (MLP) improves the error to 26.9mm. These results are coherent with the ones on the other two datasets. On the ETH-XGaze dataset, the model with Transformer module achieves a mean angular error of 3.59{\deg} and without Transformer module 3.26{\deg}, whereas the fundamentally different model architecture used by the dataset authors Zhang et al. achieves a mean angular error of 2.04{\deg}. On the OTH-Gaze-Estimation dataset created for...
comment: Master Thesis with 125 pages, 59 figures, 17 tables
Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling
High-resolution (4K) image-to-image synthesis has become increasingly important for mobile applications. Existing diffusion models for image editing face significant challenges, in terms of memory and image quality, when deployed on resource-constrained devices. In this paper, we present MobilePicasso, a novel system that enables efficient image editing at high resolutions, while minimising computational cost and memory usage. MobilePicasso comprises three stages: (i) performing image editing at a standard resolution with hallucination-aware loss, (ii) applying latent projection to overcome going to the pixel space, and (iii) upscaling the edited image latent to a higher resolution with adaptive context-preserving tiling. Our user study with 46 participants reveals that MobilePicasso not only improves image quality by 18-48% but reduces hallucinations by 14-51% over existing methods. MobilePicasso demonstrates significantly lower latency, e.g., up to 55.8$\times$ speed-up, yet with a small increase in runtime memory, e.g., a mere 9% increase over prior work. Surprisingly, the on-device runtime of MobilePicasso is observed to be faster than a server-based high-resolution image editing model running on an A100 GPU.
comment: Preprint. Under review
☆ ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations
While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to hinder their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this issue, we propose ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of images and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.
♻ ☆ MoME: Estimating Psychological Traits from Gait with Multi-Stage Mixture of Movement Experts
Gait encodes rich biometric and behavioural information, yet leveraging the manner of walking to infer psychological traits remains a challenging and underexplored problem. We introduce a hierarchical Multi-Stage Mixture of Movement Experts (MoME) architecture for multi-task prediction of psychological attributes from gait sequences represented as 2D poses. MoME processes the walking cycle in four stages of movement complexity, employing lightweight expert models to extract spatio-temporal features and task-specific gating modules to adaptively weight experts across traits and stages. Evaluated on the PsyMo benchmark covering 17 psychological traits, our method outperforms state-of-the-art gait analysis models, achieving a 37.47% weighted F1 score at the run level and 44.6% at the subject level. Our experiments show that integrating auxiliary tasks such as identity recognition, gender prediction, and BMI estimation further improves psychological trait estimation. Our findings demonstrate the viability of multi-task gait-based learning for psychological trait estimation and provide a foundation for future research on movement-informed psychological inference.
comment: 4 Figures, 4 Tables
♻ ☆ ExGS: Extreme 3D Gaussian Compression with Diffusion Priors
Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce ExGS, a novel feed-forward framework that unifies Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS compression. UGC performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas GaussPainter leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at: https://github.com/chenttt2001/ExGS
♻ ☆ Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning
We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97\%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8\% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.
comment: Technical Report
♻ ☆ Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising
Building on recent advances in Bayesian statistics and image denoising, we propose Noise2Score3D, a fully unsupervised framework for point cloud denoising. Noise2Score3D learns the score function of the underlying point cloud distribution directly from noisy data, eliminating the need for clean data during training. Using Tweedie's formula, our method performs denoising in a single step, avoiding the iterative processes used in existing unsupervised methods, thus improving both accuracy and efficiency. Additionally, we introduce Total Variation for Point Clouds as a denoising quality metric, which allows for the estimation of unknown noise parameters. Experimental results demonstrate that Noise2Score3D achieves state-of-the-art performance on standard benchmarks among unsupervised learning methods in Chamfer distance and point-to-mesh metrics. Noise2Score3D also demonstrates strong generalization ability beyond training datasets. Our method, by addressing the generalization issue and challenge of the absence of clean data in learning-based methods, paves the way for learning-based point cloud denoising methods in real-world applications.
comment: arXiv admin note: substantial text overlap with arXiv:2502.16826
♻ ☆ Electromagnetic Inverse Scattering from a Single Transmitter
Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging, especially under sparse transmitter setups, e.g., with only one transmitter. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires time-consuming case-specific optimization and fails under sparse transmitter setups. To address these limitations, we revisit EISP from a data-driven perspective. The scarcity of transmitters leads to an insufficient amount of measured data, which fails to capture adequate physical information for stable inversion. Built on this insight, we propose a fully end-to-end and data-driven framework that predicts the relative permittivity of scatterers from measured fields, leveraging data distribution priors to compensate for the lack of physical information. This design enables data-driven training and feed-forward prediction of relative permittivity while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy and robustness. Notably, it achieves high-quality results even with a single transmitter, a setting where previous methods consistently fail. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.
♻ ☆ Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.
♻ ☆ OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.
comment: technical report, project url:https://onecat-ai.github.io/
♻ ☆ Trajectory Prediction Meets Large Language Models: A Survey
Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.
comment: 16 pages, GitHub: https://github.com/colorfulfuture/Awesome-Trajectory-Motion-Prediction-Papers
♻ ☆ HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping
Novel View Synthesis (NVS) from sparse views presents a formidable challenge in 3D reconstruction, where limited multi-view constraints lead to severe overfitting, geometric distortion, and fragmented scenes. While 3D Gaussian Splatting (3DGS) delivers real-time, high-fidelity rendering, its performance drastically deteriorates under sparse inputs, plagued by floating artifacts and structural failures. To address these challenges, we introduce HBSplat, a unified framework that elevates 3DGS by seamlessly integrating robust structural cues, virtual view constraints, and occluded region completion. Our core contributions are threefold: a Hybrid-Loss Depth Estimation module that ensures multi-view consistency by leveraging dense matching priors and integrating reprojection, point propagation, and smoothness constraints; a Bidirectional Warping Virtual View Synthesis method that enforces substantially stronger constraints by creating high-fidelity virtual views through bidirectional depth-image warping and multi-view fusion; and an Occlusion-Aware Reconstruction component that recovers occluded areas using a depth-difference mask and a learning-based inpainting model. Extensive evaluations on LLFF, Blender, and DTU benchmarks validate that HBSplat sets a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while maintaining real-time inference. Code is available at: https://github.com/eternalland/HBSplat.
comment: 14 pages, 21 figures
♻ ☆ Unified Cross-Modal Medical Image Synthesis with Hierarchical Mixture of Product-of-Experts
We propose a deep mixture of multimodal hierarchical variational auto-encoders called MMHVAE that synthesizes missing images from observed images in different modalities. MMHVAE's design focuses on tackling four challenges: (i) creating a complex latent representation of multimodal data to generate high-resolution images; (ii) encouraging the variational distributions to estimate the missing information needed for cross-modal image synthesis; (iii) learning to fuse multimodal information in the context of missing data; (iv) leveraging dataset-level information to handle incomplete data sets at training time. Extensive experiments are performed on the challenging problem of pre-operative brain multi-parametric magnetic resonance and intra-operative ultrasound imaging.
comment: Accepted in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
♻ ☆ Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction
In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.
♻ ☆ MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling
Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training. The modeling of human-environment interactions is improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons between MoSA and a variety of approaches, including general video generation models, human video generation models, and human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the majority of evaluation metrics.
comment: Project: https://hywang2002.github.io/MoSA
♻ ☆ Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search NeurIPS2025
The remarkable progress in text-to-video diffusion models enables the generation of photorealistic videos, although the content of these generated videos often includes unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some measure of the content's goodness. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward at inference time. We then point out that improving perceptual video quality with respect to alignment to prompts requires reward calibration by weighting existing metrics. This is because when humans or vision language models evaluate outputs, many previous metrics to quantify the naturalness of video do not always correlate with the evaluation. We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling under much more efficient computational cost. The experiments highlight that our method is beneficial to many capable generative models, and provide a practical guideline: we should prioritize the inference-time compute allocation into enabling the lookahead estimator and increasing the search budget, rather than expanding the denoising steps.
comment: Accepted to NeurIPS2025. Website: https://sites.google.com/view/t2v-dlbs and Code: https://github.com/shim0114/T2V-Diffusion-Search
♻ ☆ VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from image-text misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image-text integration.
♻ ☆ Teaching Metric Distance to Discrete Autoregressive Language Models
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.
♻ ☆ v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
When thinking with images, humans rarely rely on a single glance: they revisit visual information repeatedly during reasoning. However, existing models typically process images only once and thereafter generate reasoning entirely in text, lacking mechanisms to re-access or ground inference in visual representations. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. In response, we introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach. This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream, ensuring that evolving hypotheses remain grounded in perceptual evidence. Crucially, our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Across various multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines, establishing point-and-copy as a practical mechanism for grounded reasoning. The model checkpoint and dataset are available at github.com/jun297/v1.
♻ ☆ A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond
Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.
♻ ☆ AutoEdit: Automatic Hyperparameter Tuning for Image Editing
Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification. This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world. Codes can be found at https://github.com/chaupham1709/AutoEdit.git.
comment: Provided code link
♻ ☆ Low-Rank Tensor Recovery via Variational Schatten-p Quasi-Norm and Jacobian Regularization
Higher-order tensors are well-suited for representing multi-dimensional data, such as images and videos, which typically characterize low-rank structures. Low-rank tensor decomposition has become essential in machine learning and computer vision, but existing methods like Tucker decomposition offer flexibility at the expense of interpretability. The CANDECOMP/PARAFAC (CP) decomposition provides a natural and interpretable structure, while obtaining a sparse solutions remains challenging. Leveraging the rich properties of CP decomposition, we propose a CP-based low-rank tensor function parameterized by neural networks (NN) for implicit neural representation. This approach can model the tensor both on-grid and beyond grid, fully utilizing the non-linearity of NN with theoretical guarantees on excess risk bounds. To achieve sparser CP decomposition, we introduce a variational Schatten-p quasi-norm to prune redundant rank-1 components and prove that it serves as a common upper bound for the Schatten-p quasi-norms of arbitrary unfolding matrices. For smoothness, we propose a regularization term based on the spectral norm of the Jacobian and Hutchinson's trace estimator. The proposed smoothness regularization is SVD-free and avoids explicit chain rule derivations. It can serve as an alternative to Total Variation (TV) regularization in image denoising tasks and is naturally applicable to implicit neural representation. Extensive experiments on multi-dimensional data recovery tasks, including image inpainting, denoising, and point cloud upsampling, demonstrate the superiority and versatility of our method compared to state-of-the-art approaches. The code is available at https://github.com/CZY-Code/CP-Pruner.
♻ ☆ ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NeurIPS 2025
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
comment: Accepted at NeurIPS 2025 (oral)
♻ ☆ RimSet: Quantitatively Identifying and Characterizing Chronic Active Multiple Sclerosis Lesion on Quantitative Susceptibility Maps
Background: Rim+ lesions in multiple sclerosis (MS), detectable via Quantitative Susceptibility Mapping (QSM), correlate with increased disability. Existing literature lacks quantitative analysis of these lesions. We introduce RimSet for quantitative identification and characterization of rim+ lesions on QSM. Methods: RimSet combines RimSeg, an unsupervised segmentation method using level-set methodology, and radiomic measurements with Local Binary Pattern texture descriptors. We validated RimSet using simulated QSM images and an in vivo dataset of 172 MS subjects with 177 rim+ and 3986 rim-lesions. Results: RimSeg achieved a 78.7% Dice score against the ground truth, with challenges in partial rim lesions. RimSet detected rim+ lesions with a partial ROC AUC of 0.808 and PR AUC of 0.737, surpassing existing methods. QSMRim-Net showed the lowest mean square error (0.85) and high correlation (0.91; 95% CI: 0.88, 0.93) with expert annotations at the subject level.
comment: 13 pages, 7 figures, 4 tables
♻ ☆ Enhancing Fitness Movement Recognition with Attention Mechanism and Pre-Trained Feature Extractors
Fitness movement recognition, a focused subdomain of human activity recognition (HAR), plays a vital role in health monitoring, rehabilitation, and personalized fitness training by enabling automated exercise classification from video data. However, many existing deep learning approaches rely on computationally intensive 3D models, limiting their feasibility in real-time or resource-constrained settings. In this paper, we present a lightweight and effective framework that integrates pre-trained 2D Convolutional Neural Networks (CNNs) such as ResNet50, EfficientNet, and Vision Transformers (ViT) with a Long Short-Term Memory (LSTM) network enhanced by spatial attention. These models efficiently extract spatial features while the LSTM captures temporal dependencies, and the attention mechanism emphasizes informative segments. We evaluate the framework on a curated subset of the UCF101 dataset, achieving a peak accuracy of 93.34\% with the ResNet50-based configuration. Comparative results demonstrate the superiority of our approach over several state-of-the-art HAR systems. The proposed method offers a scalable and real-time-capable solution for fitness activity recognition with broader applications in vision-based health and activity monitoring.
comment: 6 pages,9 figures, 2025 28th International Conference on Computer and Information Technology (ICCIT)
♻ ☆ A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)
Background: Apparent Diffusion Coefficient (ADC) values and Total Diffusion Volume (TDV) from Whole-body diffusion-weighted MRI (WB-DWI) are recognized cancer imaging biomarkers. However, manual disease delineation for ADC and TDV measurements is unfeasible in clinical practice, demanding automation. As a first step, we propose an algorithm to generate fast and reproducible probability maps of the skeleton, adjacent internal organs (liver, spleen, urinary bladder, and kidneys), and spinal canal. Methods: We developed an automated deep-learning pipeline based on a 3D patch-based Residual U-Net architecture that localises and delineates these anatomical structures on WB-DWI. The algorithm was trained using "soft labels" (non-binary segmentations) derived from a computationally intensive atlas-based approach. For training and validation, we employed a multi-centre WB-DWI dataset comprising 532 scans from patients with Advanced Prostate Cancer (APC) or Multiple Myeloma (MM), with testing on 45 patients. Results: Our weakly-supervised deep learning model achieved an average dice score of 0.67 for whole skeletal delineation, 0.76 when excluding ribcage, 0.83 for internal organs, and 0.86 for spinal canal, with average surface distances below 3mm. Relative median ADC differences between automated and manual full-body delineations were below 10%. The model was 12x faster than the atlas-based registration algorithm (25 sec vs. 5 min). Two experienced radiologists rated the model's outputs as either "good" or "excellent" on test scans, with inter-reader agreement from fair to substantial (Gwet's AC1 = 0.27-0.72). Conclusion: The model offers fast, reproducible probability maps for localising and delineating body regions on WB-DWI, potentially enabling non-invasive imaging biomarker quantification to support disease staging and treatment response assessment.
♻ ☆ When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding NeurIPS 2025
Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of 1,740 samples spanning both semantic and non-semantic cases, with manually curated question answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
comment: Accepted by NeurIPS 2025
♻ ☆ A discussion about violin reduction: geometric analysis of contour lines and channel of minima
Some early violins have been reduced during their history to fit imposed morphological standards, while more recent ones have been built directly to these standards. We can observe differences between reduced and unreduced instruments, particularly in their contour lines and channel of minima. In a recent preliminary work, we computed and highlighted those two features for two instruments using triangular 3D meshes acquired by photogrammetry, whose fidelity has been assessed and validated with sub-millimetre accuracy. We propose here an extension to a corpus of 38 violins, violas and cellos, and introduce improved procedures, leading to a stronger discussion of the geometric analysis. We first recall the material we are working with. We then discuss how to derive the best reference plane for the violin alignment, which is crucial for the computation of contour lines and channel of minima. Finally, we show how to compute efficiently both characteristics and we illustrate our results with a few examples.
comment: Paper accepted for the Florence Heri-Tech 2024 Conference
♻ ☆ AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection
Omni-domain infrared small target detection (Omni-IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multimodal framework that is the first to incorporate metadata into the IRSTD paradigm for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at https://github.com/GrokCV/AuxDet.
♻ ☆ Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness
Diffusion models have achieved unprecedented success in image generation but pose increasing risks in terms of privacy, fairness, and security. A growing demand exists to \emph{erase} sensitive or harmful concepts (e.g., NSFW content, private individuals, artistic styles) from these models while preserving their overall generative capabilities. We introduce \textbf{SCORE} (Secure and Concept-Oriented Robust Erasure), a novel framework for robust concept removal in diffusion models. SCORE formulates concept erasure as an \emph{adversarial independence} problem, theoretically guaranteeing that the model's outputs become statistically independent of the erased concept. Unlike prior heuristic methods, SCORE minimizes the mutual information between a target concept and generated outputs, yielding provable erasure guarantees. We provide formal proofs establishing convergence properties and derive upper bounds on residual concept leakage. Empirically, we evaluate SCORE on Stable Diffusion and FLUX across four challenging benchmarks: object erasure, NSFW removal, celebrity face suppression, and artistic style unlearning. SCORE consistently outperforms state-of-the-art methods including EraseAnything, ANT, MACE, ESD, and UCE, achieving up to \textbf{12.5\%} higher erasure efficacy while maintaining comparable or superior image quality. By integrating adversarial optimization, trajectory consistency, and saliency-driven fine-tuning, SCORE sets a new standard for secure and robust concept erasure in diffusion models.
comment: updated version
♻ ☆ Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.
comment: Killian Steunou is the main contributor and corresponding author of this work
♻ ☆ Incremental Object Detection with Prompt-based Methods ICCV
Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.
comment: Accepted to ICCV Workshops 2025: v2 update affiliation
♻ ☆ RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection ICCV
Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models' inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.
comment: Accepted to ICCV Workshops 2025; v2: add GitHub link and update affiliation
♻ ☆ Bridging Semantic Logic Gaps: A Cognition Inspired Multimodal Boundary Preserving Network for Image Manipulation Localization
The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition inspired multimodal boundary preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models. Our code is available on https://github.com/vpsg-research/CMB-Net.
♻ ☆ Evaluation of Deformable Image Registration under Alignment-Regularity Trade-of
Evaluating deformable image registration (DIR) is challenging due to the inherent trade-off between achieving high alignment accuracy and maintaining deformation regularity. However, most existing DIR works either address this trade-off inadequately or overlook it altogether. In this paper, we highlight the issues with existing practices and propose an evaluation scheme that captures the trade-off continuously to holistically evaluate DIR methods. We first introduce the alignment regularity characteristic (ARC) curves, which describe the performance of a given registration method as a spectrum under various degrees of regularity. We demonstrate that the ARC curves reveal unique insights that are not evident from existing evaluation practices, using experiments on representative deep learning DIR methods with various network architectures and transformation models. We further adopt a HyperNetwork based approach that learns to continuously interpolate across the full regularization range, accelerating the construction and improving the sample density of ARC curves. Finally, we provide general guidelines for a nuanced model evaluation and selection using our evaluation scheme for both practitioners and registration researchers.
♻ ☆ Background Semantics Matter: Cross-Task Feature Exchange Network for Clustered Infrared Small Target Detection
Infrared small target detection presents significant challenges due to the limited intrinsic features of the target and the overwhelming presence of visually similar background distractors. We contend that background semantics are critical for distinguishing between objects that appear visually similar in this context. To address this challenge, we propose a task, clustered infrared small target detection, and introduce DenseSIRST, a benchmark dataset that provides per-pixel semantic annotations for background regions. This dataset facilitates the shift from sparse to dense target detection. This dataset facilitates the shift from sparse to dense target detection. Building on this resource, we propose the Background-Aware Feature Exchange Network (BAFE-Net), a multi-task architecture that jointly tackles target detection and background semantic segmentation. BAFE-Net incorporates a dynamic cross-task feature hard-exchange mechanism, enabling the effective exchange of target and background semantics between the two tasks. Comprehensive experiments demonstrate that BAFE-Net significantly enhances target detection accuracy while mitigating false alarms. The DenseSIRST dataset, along with the code and trained models, is publicly available at https://github.com/GrokCV/BAFE-Net.
♻ ☆ Towards Unified Image Deblurring using a Mixture-of-Experts Decoder
Image deblurring, removing blurring artifacts from images, is a fundamental task in computational photography and low-level computer vision. Existing approaches focus on specialized solutions tailored to particular blur types, thus, these solutions lack generalization. This limitation in current methods implies requiring multiple models to cover several blur types, which is not practical in many real scenarios. In this paper, we introduce the first all-in-one deblurring method capable of efficiently restoring images affected by diverse blur degradations, including global motion, local motion, blur in low-light conditions, and defocus blur. We propose a mixture-of-experts (MoE) decoding module, which dynamically routes image features based on the recognized blur degradation, enabling precise and efficient restoration in an end-to-end manner. Our unified approach not only achieves performance comparable to dedicated task-specific models, but also shows promising generalization to unseen blur scenarios, particularly when leveraging appropriate expert selection. Code available at https://github.com/cidautai/DeMoE.
♻ ☆ The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?
Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.
♻ ☆ Cat: Post-Training Quantization Error Reduction via Cluster-based Affine Transformation
Post-Training Quantization (PTQ) reduces the memory footprint and computational overhead of deep neural networks by converting full-precision (FP) values into quantized and compressed data types. While PTQ is more cost-efficient than Quantization-Aware Training (QAT), it is highly susceptible to accuracy degradation under a low-bit quantization (LQ) regime (e.g., 2-bit). Affine transformation is a classical technique used to reduce the discrepancy between the information processed by a quantized model and that processed by its full-precision counterpart; however, we find that using plain affine transformation, which applies a uniform affine parameter set for all outputs, worsens the results in low-bit PTQ. To address this, we propose Cluster-based Affine Transformation (CAT), an error-reduction framework that employs cluster-specific parameters to align LQ outputs with FP counterparts. CAT refines LQ outputs with only a negligible number of additional parameters, without requiring fine-tuning of the model or quantization parameters. We further introduce a novel PTQ framework integrated with CAT. Experiments on ImageNet-1K show that this framework consistently outperforms prior PTQ methods across diverse architectures and LQ settings, achieving up to 53.18% Top-1 accuracy on W2A2 ResNet-18. Moreover, CAT enhances existing PTQ baselines by more than 3% when used as a plug-in. We plan to release our implementation alongside the publication of this paper.
comment: 29 pages, 20 figures
♻ ☆ Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods Study
SkinGPT-4, a large vision-language model, leverages annotated skin disease images to augment clinical workflows in underserved communities. However, its training dataset predominantly represents lighter skin tones, limiting diagnostic accuracy for darker tones. Here, we evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases, including eczema, allergic-contact dermatitis, and psoriasis using the open-sourced SCIN dataset. We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks and explored bias mitigation strategies. Clinical evaluation by board-certified dermatologists on six relevant skin diseases from 300 SCIN cases assessed images for diagnostic accuracy, informativity, physician utility, and patient utility. Model fairness metrics, including demographic parity and equalized odds, were calculated across skin tones. SkinGPT-4 achieved an average demographic parity of 0.10 across Fitzpatrick types, with notable differences of 0.10-0.15 between lightest and darkest tones across evaluation metrics. Model hallucinations in artifacts and anatomy occurred at a rate of 17.8. Our customized models achieved average F1, precision, and AUROC of 0.75, 0.78, and 0.78 across visually similar disease pairs. Fairness analysis showed an average demographic parity of 0.75, with a maximum disparity of 0.21 across skin tones. The best model achieved parity scores of 0.83, 0.83, 0.76, 0.89, 0.90, and 0.90 for Fitzpatrick I-VI, indicating robust fairness. Large language models such as SkinGPT-4 showed weaker performance on darker tones. Model biases exist across evaluation criteria, and hallucinations may affect diagnostic efficacy. These findings demonstrate the efficacy of training accurate, fair models using existing backbones for custom skin disease classification.
comment: Accepted to EADV (European Academy of Dermatology) and SID (Society for Investigative Dermatology)
♻ ☆ Imagining the Unseen: Generative Location Modeling for Object Placement ICCV 2025
Location modeling, or determining where non-existing objects could feasibly appear in a scene, has the potential to benefit numerous computer vision tasks, from automatic object insertion to scene creation in virtual reality. Yet, this capability remains largely unexplored to date. In this paper, we develop a generative location model that, given an object class and an image, learns to predict plausible bounding boxes for such an object. Our approach first tokenizes the image and target object class, then decodes bounding box coordinates through an autoregressive transformer. This formulation effectively addresses two core challenges in locatio modeling: the inherent one-to-many nature of plausible locations, and the sparsity of existing location modeling datasets, where fewer than 1% of valid placements are labeled. Furthermore, we incorporate Direct Preference Optimization to leverage negative labels, refining the spatial predictions. Empirical evaluations reveal that our generative location model achieves superior placement accuracy on the OPA dataset as compared to discriminative baselines and image composition approaches. We further test our model in the context of object insertion, where it proposes locations for an off-the-shelf inpainting model to render objects. In this respect, our proposal exhibits improved visual coherence relative to state-of-the-art instruction-tuned editing methods, demonstrating a high-performing location model's utility in a downstream application.
comment: Accepted by ICCV 2025 DRL4Real Workshop
♻ ☆ Submillimeter-Accurate 3D Lumbar Spine Reconstruction from Biplanar X-Ray Images: Incorporating a Multi-Task Network and Landmark-Weighted Loss
To meet the clinical demand for accurate 3D lumbar spine assessment in a weight-bearing position, this study presents a novel, fully automatic framework for high-precision 3D reconstruction from biplanar X-ray images, overcoming the limitations of existing methods. The core of this method involves a novel multi-task deep learning network that simultaneously performs lumbar decomposition and landmark detection on the original biplanar radiographs. The decomposition effectively eliminates interference from surrounding tissues, simplifying subsequent image registration, while the landmark detection provides an initial pose estimation for the Statistical Shape Model (SSM), enhancing the efficiency and robustness of the registration process. Building on this, we introduce a landmark-weighted 2D-3D registration strategy. By assigning higher weights to complex posterior structures like the transverse and spinous processes during optimization, this strategy significantly enhances the reconstruction accuracy of the posterior arch. Our method was validated against a gold standard derived from registering CT segmentations to the biplanar X-rays. It sets a new benchmark by achieving sub-millimeter accuracy and completes the full reconstruction and measurement workflow in under 20 seconds, establishing a state-of-the-art combination of precision and speed. This fast and low-dose pipeline provides a powerful automated tool for diagnosing lumbar conditions such as spondylolisthesis and scoliosis in their functional, weight-bearing state.
comment: 27 pages, 16 figures, 9 tables
♻ ☆ Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments
This is the preprint version of the article published in IEEE Access vol. 13, pp. 146503--146526, 2025, doi:10.1109/ACCESS.2025.3599541. Please cite the published version. Urban air pollution remains a pressing global concern, particularly in densely populated and traffic-intensive metropolitan areas like Delhi, where exposure to harmful pollutants severely impacts public health. Delhi, being one of the most polluted cities globally, experiences chronic air quality issues due to vehicular emissions, industrial activities, and construction dust, which exacerbate its already fragile atmospheric conditions. Traditional pollution mitigation strategies, such as static air purifying installations, often fail to maximize their impact due to suboptimal placement and limited adaptability to dynamic urban environments. This study presents a novel deep reinforcement learning (DRL) framework to optimize the placement of air purification booths to improve the air quality index (AQI) in the city of Delhi. We employ Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm, to iteratively learn and identify high-impact locations based on multiple spatial and environmental factors, including population density, traffic patterns, industrial influence, and green space constraints. Our approach is benchmarked against conventional placement strategies, including random and greedy AQI-based methods, using multi-dimensional performance evaluation metrics such as AQI improvement, spatial coverage, population and traffic impact, and spatial entropy.
comment: This is the preprint version of the article published in IEEE Access vol. 13, pp. 146503--146526, 2025, doi:10.1109/ACCESS.2025.3599541. Please cite the published version
♻ ☆ PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization
Accurate 3D shape representation is essential in engineering applications such as design, optimization, and simulation. In practice, engineering workflows require structured, part-based representations, as objects are inherently designed as assemblies of distinct components. However, most existing methods either model shapes holistically or decompose them without predefined part structures, limiting their applicability in real-world design tasks. We propose PartSDF, a supervised implicit representation framework that explicitly models composite shapes with independent, controllable parts while maintaining shape consistency. Thanks to its simple but innovative architecture, PartSDF outperforms both supervised and unsupervised baselines in reconstruction and generation tasks. We further demonstrate its effectiveness as a structured shape prior for engineering applications, enabling precise control over individual components while preserving overall coherence. Code available at https://github.com/cvlab-epfl/PartSDF.
comment: Accepted to TMLR (27 pages, 14 figures)
♻ ☆ VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts and underrepresent structured visual relationships such as pose and viewpoint. We propose Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a practical and principled path that energizes further advances in vision-language retrieval. Our code and the Visual-RAG-ME benchmark will be publicly released.
♻ ☆ Think Before You Diffuse: Infusing Physical Rules into Video Diffusion
Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to infer rich physical context from the text prompt. To incorporate this context into the video diffusion model, we use a multimodal large language model (MLLM) to verify intermediate latent variables against the inferred physical rules, guiding the gradient updates of model accordingly. Textual output of LLM is transformed into continuous signals. We then formulate a set of training objectives that jointly ensure physical accuracy and semantic alignment with the input text. Additionally, failure facts of physical phenomena are corrected via attention injection. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/.
comment: 19 pages, 8 figures
♻ ☆ SAMCIRT: A Simultaneous Reconstruction and Affine Motion Compensation Technique for Four Dimensional Computed Tomography (4DCT)
The majority of the recent iterative approaches in 4DCT not only rely on nested iterations, thereby increasing computational complexity and constraining potential acceleration, but also fail to provide a theoretical proof of convergence for their proposed iterative schemes. On the other hand, the latest MATLAB and Python image processing toolboxes lack the implementation of analytic adjoints of affine motion operators for 3D object volumes, which does not allow gradient methods using exact derivatives towards affine motion parameters. In this work, we propose the Simultaneous Affine Motion-Compensated Image Reconstruction Technique (SAMCIRT)- an efficient iterative reconstruction scheme that combines image reconstruction and affine motion estimation in a single update step, based on the analytic adjoints of the motion operators then exact partial derivatives with respect to both the reconstruction and the affine motion parameters. Moreover, we prove the separated Lipschitz continuity of the objective function and its associated functions, including the gradient, which supports the convergence of our proposed iterative scheme, despite the non-convexity of the objective function with respect to the affine motion parameters. Results from simulation and real experiments show that our method outperforms the state-of-the-art CT reconstruction with affine motion correction methods in computational feasibility and projection distance. In particular, this allows accurate reconstruction for a real, nonstationary diamond, showing a novel application of 4DCT.
comment: 25 pages, revised version submitted to the SIAM Journal on Imaging Sciences (SIIMS)
♻ ☆ Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.
♻ ☆ LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders ICCV 2025
In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation. Our approach treats short- and long-span dependencies as two separate tasks. Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. To achieve this, we leverage advanced off-the-shelf multimodal encoders to extract representations from short segments within the long video, followed by pre-training a masked-embedding autoencoder capturing high-level interactions across segments. LV-MAE is highly efficient to train and enables the processing of much longer videos by alleviating the constraint on the number of input frames. Furthermore, unlike existing methods that typically pre-train on short-video datasets, our approach offers self-supervised pre-training using long video samples (e.g., 20+ minutes video clips) at scale. Using LV-MAE representations, we achieve state-of-the-art results on three long-video benchmarks -- LVU, COIN, and Breakfast -- employing only a simple classification head for either attentive or linear probing. Finally, to assess LV-MAE pre-training and visualize its reconstruction quality, we leverage the video-language aligned space of short video representations to monitor LV-MAE through video-text retrieval. Code is available at https://github.com/amazon-science/lv-mae.
comment: Accepted to the International Conference on Computer Vision, ICCV 2025
♻ ☆ MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis
Deep learning algorithms require extensive data to achieve robust performance. However, data availability is often restricted in the medical domain due to patient privacy concerns. Synthetic data presents a possible solution to these challenges. Recently, image generative models have found increasing use for medical applications but are often designed for singular medical specialties and imaging modalities, thus limiting their broader utility. To address this, we introduce MediSyn: a text-guided, latent diffusion model capable of generating synthetic images from 6 medical specialties and 10 image types. Through extensive experimentation, we first demonstrate that MediSyn quantitatively matches or surpasses the performance of specialist models. Second, we show that our synthetic images are realistic and exhibit strong alignment with their corresponding text prompts, as validated by a team of expert physicians. Third, we provide empirical evidence that our synthetic images are visually distinct from their corresponding real patient images. Finally, we demonstrate that in data-limited settings, classifiers trained solely on synthetic data or real data supplemented with synthetic data can outperform those trained solely on real data. Our findings highlight the immense potential of generalist image generative models to accelerate algorithmic research and development in medicine.
♻ ☆ Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting NeurIPS 2025
3D reconstruction from in-the-wild images remains a challenging task due to inconsistent lighting conditions and transient distractors. Existing methods typically rely on heuristic strategies to handle the low-quality training data, which often struggle to produce stable and consistent reconstructions, frequently resulting in visual artifacts.In this work, we propose \modelname{}, a novel framework that leverages the stochastic nature of these artifacts: they tend to vary across different training runs due to minor randomness. Specifically, our method trains two 3D Gaussian Splatting (3DGS) models in parallel, enforcing a consistency constraint that encourages convergence on reliable scene geometry while suppressing inconsistent artifacts. To prevent the two models from collapsing into similar failure modes due to confirmation bias, we introduce a divergent masking strategy that applies two complementary masks: a multi-cue adaptive mask and a self-supervised soft mask, which leads to an asymmetric training process of the two models, reducing shared error modes. In addition, to improve the efficiency of model training, we introduce a lightweight variant called Dynamic EMA Proxy, which replaces one of the two models with a dynamically updated Exponential Moving Average (EMA) proxy, and employs an alternating masking strategy to preserve divergence. Extensive experiments on challenging real-world datasets demonstrate that our method consistently outperforms existing approaches while achieving high efficiency. See the project website at https://steveli88.github.io/AsymGS.
comment: NeurIPS 2025 Spotlight; Project page: https://steveli88.github.io/AsymGS/
♻ ☆ Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.
♻ ☆ Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering ICCV 2025
While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.
comment: Accepted by ICCV 2025
♻ ☆ Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization
Multimodal Large Language Models (MLLMs) excel in vision-language reasoning but often struggle with structured perception tasks requiring precise localization and robustness. We propose a reinforcement learning framework that augments Group Relative Policy Optimization (GRPO) with curriculum-based data scheduling and difficulty-aware filtering. This approach stabilizes optimization under sparse, noisy rewards and enables progressive adaptation to complex samples. Evaluations on autonomous driving benchmarks demonstrate substantial improvements in detection accuracy and robustness. Ablation studies confirm the importance of reward design, KL regularization, and curriculum pacing for convergence stability and generalization. Our findings highlight reinforcement-driven optimization with structured data curricula as a scalable path toward robust and interpretable multimodal detection.
♻ ☆ ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge
Edge computing enables data processing closer to the source, significantly reducing latency, an essential requirement for real-time vision-based analytics such as object detection in surveillance and smart city environments. However, these tasks place substantial demands on resource-constrained edge devices, making the joint optimization of energy consumption and detection accuracy critical. To address this challenge, we propose ECORE, a framework that integrates multiple dynamic routing strategies, including a novel estimation-based techniques and an innovative greedy selection algorithm, to direct image processing requests to the most suitable edge device-model pair. ECORE dynamically balances energy efficiency and detection performance based on object characteristics. We evaluate our framework through extensive experiments on real-world datasets, comparing against widely used baseline techniques. The evaluation leverages established object detection models (YOLO, SSD, EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed context-aware routing strategies can reduce energy consumption and latency by 35% and 49%, respectively, while incurring only a 2% loss in detection accuracy compared to accuracy-centric methods.
♻ ☆ High-pass filtered fidelity-imposed network edit (HP-FINE) for robust quantitative susceptibility mapping from high-pass filtered phase
Purpose: To improve the generalization ability of deep learning based predictions of quantitative susceptibility mapping (QSM) from high-pass filtered phase (HPFP) data. Methods: A network fine-tuning step called HP-FINE is proposed, which is based on the high-pass filtering forward model with low-frequency preservation regularization. Several comparisons were conducted: 1. HP-FINE with and without low-frequency regularization, 2. three 3D network architectures (Unet, Progressive Unet, and Big Unet), 3. two types of network output (recovered field and susceptibility), and 4. pre-training with and without the filtering augmentation. HPFP datasets with diverse high-pass filters, another acquisition voxel size, and prospective acquisition were used to assess the accuracy of QSM predictions. In the retrospective datasets, quantitative metrics (PSNR, SSIM, RMSE and HFEN) were used for evaluation. In the prospective dataset, statistics of ROI linear regression and Bland-Altman analysis were used for evaluation. Results: In the retrospective datasets, adding low-frequency regularization in HP-FINE substantially improved prediction accuracy compared to the pre-trained results, especially when combined with the filtering augmentation and recovered field output. In the prospective datasets, HP-FINE with low-frequency regularization and recovered field output demonstrated the preservation of ROI values, a result that was not achieved when using susceptibility as the output. Furthermore, Progressive Unet pre-trained with a combination of multiple losses outperformed both Unet and Progressive Unet pre-trained with a single loss in terms of preserving ROI values.
♻ ☆ Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned
Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets despite not training VL-PRMs on such datasets. We hope our work will motivate further research and support the advancement of VLMs.
♻ ☆ GeoRemover: Removing Objects and Their Causal Visual Artifacts NeurIPS 2025
Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object's geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at https://github.com/buxiangzhiren/GeoRemover.
comment: Accepted as Spotlight at NeurIPS 2025
♻ ☆ Self-Supervised Representation Learning with Joint Embedding Predictive Architecture for Automotive LiDAR Object Detection
Recently, self-supervised representation learning relying on vast amounts of unlabeled data has been explored as a pre-training method for autonomous driving. However, directly applying popular contrastive or generative methods to this problem is insufficient and may even lead to negative transfer. In this paper, we present AD-L-JEPA, a novel self-supervised pre-training framework with a joint embedding predictive architecture (JEPA) for automotive LiDAR object detection. Unlike existing methods, AD-L-JEPA is neither generative nor contrastive. Instead of explicitly generating masked regions, our method predicts Bird's-Eye-View embeddings to capture the diverse nature of driving scenes. Furthermore, our approach eliminates the need to manually form contrastive pairs by employing explicit variance regularization to avoid representation collapse. Experimental results demonstrate consistent improvements on the LiDAR 3D object detection downstream task across the KITTI3D, Waymo, and ONCE datasets, while reducing GPU hours by 1.9x-2.7x and GPU memory by 2.8x-4x compared with the state-of-the-art method Occupancy-MAE. Notably, on the largest ONCE dataset, pre-training on 100K frames yields a 1.61 mAP gain, better than all other methods pre-trained on either 100K or 500K frames, and pre-training on 500K frames yields a 2.98 mAP gain, better than all other methods pre-trained on either 500K or 1M frames. AD-L-JEPA constitutes the first JEPA-based pre-training method for autonomous driving. It offers better quality, faster, and more GPU-memory-efficient self-supervised representation learning. The source code of AD-L-JEPA is ready to be released.
♻ ☆ HiMat: DiT-based Ultra-High Resolution SVBRDF Generation
Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.
♻ ☆ Integrating Feature Selection and Machine Learning for Nitrogen Assessment in Grapevine Leaves using In-Field Hyperspectral Imaging
Nitrogen (N) is one of the most crucial nutrients in vineyards, affecting plant growth and subsequent products such as wine and juice. Because soil N has high spatial and temporal variability, it is desirable to accurately estimate the N concentration of grapevine leaves and manage fertilization at the individual plant level to optimally meet plant needs. In this study, we used in-field hyperspectral images with wavelengths ranging from $400 to 1000nm of four different grapevine cultivars collected from distinct vineyards and over two growth stages during two growing seasons to develop models for predicting N concentration at the leaf-level and canopy-level. After image processing, two feature selection methods were employed to identify the optimal set of spectral bands that were responsive to leaf N concentrations. The selected spectral bands were used to train and test two different Machine Learning (ML) models, Gradient Boosting and XGBoost, for predicting nitrogen concentrations. The comparison of selected bands for both leaf-level and canopy-level datasets showed that most of the spectral regions identified by the feature selection methods were across both methods and the dataset types (leaf- and canopy-level datasets), particularly in the key regions, 500-525nm, 650-690nm, 750-800nm, and 900-950nm. These findings indicated the robustness of these spectral regions for predicting nitrogen content. The results for N prediction demonstrated that the ML model achieved an R square of 0.49 for canopy-level data and an R square of 0.57 for leaf-level data, despite using different sets of selected spectral bands for each analysis level. The study demonstrated the potential of using in-field hyperspectral imaging and the use of spectral data in integrated feature selection and ML techniques to monitor N status in vineyards.
comment: Major Revision
♻ ☆ DiffCom: Decoupled Sparse Priors Guided Diffusion Compression for Point Clouds
Lossy compression relies on an autoencoder to transform a point cloud into latent points for storage, leaving the inherent redundancy of latent representations unexplored. To reduce redundancy in latent points, we propose a diffusion-based framework guided by sparse priors that achieves high reconstruction quality, especially at low bitrates. Our approach features an efficient dual-density data flow that relaxes size constraints on latent points. It hybridizes a probabilistic conditional diffusion model to encapsulate essential details for reconstruction within sparse priors, which are decoupled hierarchically into intra- and inter-point priors. Specifically, our DiffCom encodes the original point cloud into latent points and decoupled sparse priors through separate encoders. To dynamically attend to geometric and semantic cues from the priors at each encoding and decoding layer, we employ an attention-guided latent denoiser conditioned on the decoupled priors. Additionally, we integrate the local distribution into the arithmetic encoder and decoder to enhance local context modeling of the sparse points. The original point cloud is reconstructed through a point decoder. Compared to state-of-the-art methods, our approach achieves a superior rate-distortion trade-off, as evidenced by extensive evaluations on the ShapeNet dataset and standard test datasets from the MPEG PCC Group.
♻ ☆ SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple spatial capabilities, even for handling simple and normal tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs.
♻ ☆ SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes toward Intelligent Vehicle Suspension Systems
Speed bumps and potholes are the most common road anomalies, significantly affecting ride comfort and vehicle stability. Preview-based suspension control mitigates their impact by detecting such irregularities in advance and adjusting suspension parameters proactively. Accurate and real-time detection is essential, but embedded deployment is constrained by limited computational resources and the small size of targets in input images.To address these challenges, this paper proposes SBP-YOLO, an efficient detection framework for speed bumps and potholes in embedded systems. Built upon YOLOv11n, it integrates GhostConv and VoVGSCSPC modules in the backbone and neck to reduce computation while enhancing multi-scale semantic features. A P2-level branch improves small-object detection, and a lightweight and efficient detection head (LEDH) maintains accuracy with minimal overhead. A hybrid training strategy further enhances robustness under varying road and environmental conditions, combining NWD loss, BCKD knowledge distillation, and Albumentations-based augmentation. Experiments show that SBP-YOLO achieves 87.0% mAP, outperforming the YOLOv11n baseline by 5.8%. After TensorRT FP16 quantization, it runs at 139.5 FPS on Jetson AGX Xavier, yielding a 12.4% speedup over the P2-enhanced YOLOv11. These results demonstrate the framework's suitability for fast, low-latency road condition perception in embedded suspension control systems.
comment: 14pages,11figures
♻ ☆ Longitudinal Flow Matching for Trajectory Modeling
Generative models for sequential data often struggle with sparsely sampled and high-dimensional trajectories, typically reducing the learning of dynamics to pairwise transitions. We propose Interpolative Multi-Marginal Flow Matching (IMMFM), a framework that learns continuous stochastic dynamics jointly consistent with multiple observed time points. IMMFM employs a piecewise-quadratic interpolation path as a smooth target for flow matching and jointly optimizes drift and a data-driven diffusion coefficient, supported by a theoretical condition for stable learning. This design captures intrinsic stochasticity, handles irregular sparse sampling, and yields subject-specific trajectories. Experiments on synthetic benchmarks and real-world longitudinal neuroimaging datasets show that IMMFM outperforms existing methods in both forecasting accuracy and further downstream tasks.
♻ ☆ Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification
Mammography, an X-ray-based imaging technique, remains central to the early detection of breast cancer. Recent advances in artificial intelligence have enabled increasingly sophisticated computer-aided diagnostic methods, evolving from patch-based classifiers to whole-image approaches and then to multi-view architectures that jointly analyze complementary projections. Despite this progress, several critical questions remain unanswered. In this study, we systematically investigate these issues by addressing five key research questions: (1) the role of patch classifiers in performance, (2) the transferability of natural-image-trained backbones, (3) the advantages of learn-to-resize over conventional downscaling, (4) the contribution of multi-view integration, and (5) the robustness of findings across varying image quality. Beyond benchmarking, our experiments demonstrate clear performance gains over prior work. For the CBIS-DDSM dataset, we improved single-view AUC from 0.8153 to 0.8343, and multiple-view AUC from 0.8483 to 0.8658. Using a new comparative method, we also observed a 0.0217 AUC increase when extending from single to multiple-view analysis. On the complete VinDr-Mammo dataset, the multiple-view approach further improved results, achieving a 0.0492 AUC increase over single view and reaching 0.8511 AUC overall. These results establish new state-of-the-art benchmarks, providing clear evidence of the advantages of multi-view architectures for mammogram interpretation. Beyond performance, our analysis offers principled insights into model design and transfer learning strategies, contributing to the development of more accurate and reliable breast cancer screening tools. The inference code and trained models are publicly available at https://github.com/dpetrini/multiple-view.
comment: 31 pages
♻ ☆ WAFT: Warping-Alone Field Transforms for Optical Flow
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.
♻ ☆ MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding NeurIPS 2025
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.
comment: Accepted to NeurIPS 2025
♻ ☆ Attacking the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples
Spiking neural networks (SNNs) have drawn much attention for their high energy efficiency and recent advances in classification performance. However, unlike traditional deep learning, the robustness of SNNs to adversarial examples remains underexplored. This work advances the adversarial attack side of SNNs and makes three major contributions. First, we show that successful white-box attacks on SNNs strongly depend on the surrogate gradient estimation technique, even for adversarially trained models. Second, using the best single surrogate gradient estimator, we study the transferability of adversarial examples between SNNs and state-of-the-art architectures such as Vision Transformers (ViTs) and CNNs. Our analysis reveals two major gaps: no existing white-box attack leverages multiple surrogate estimators, and no single attack effectively fools both SNNs and non-SNN models simultaneously. Third, we propose the Mixed Dynamic Spiking Estimation (MDSE) attack, which dynamically combines multiple surrogate gradients to overcome these gaps. MDSE produces adversarial examples that fool both SNN and non-SNN models, achieving up to 91.4% higher effectiveness on SNN/ViT ensembles and a 3x boost on adversarially trained SNN ensembles over Auto-PGD. Experiments span three datasets (CIFAR-10, CIFAR-100, ImageNet) and nineteen classifiers, and we will release code and models upon publication.
comment: Accepted manuscript. Published in *Neurocomputing*, Volume 656, 2025, Article 131506. Available online 12 September 2025. DOI: 10.1016/j.neucom.2025.131506
♻ ☆ Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts ICCV 2025
Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To address this, we introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance. EraseBench includes over 100 curated concepts, targeted evaluation prompts, and a robust set of metrics to assess both effectiveness and side effects of erasure. Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.
comment: Accepted for publication at ICCV 2025
♻ ☆ Efficient Flow Matching using Latent Variables
Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.
Machine Learning 260
☆ TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
☆ Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learning (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross-stratum bias-an "apples-to-oranges" comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of complex, multi-step search strategies. To address this, we propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers. Our analysis proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates inside each stratum, and retains the global unbiasedness and unit-variance properties enjoyed by standard normalization, resulting in a more pure and scale-stable learning signal. To improve practical stability under finite-sample regimes, we further linearly blend SAN with the global estimator. Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.
☆ Training Dynamics Impact Post-Training Quantization Robustness
While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.
☆ Modulation Discovery with Differentiable Digital Signal Processing
Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that allow users to modulate the output with ease. However, determining the modulation signals used to create a sound is difficult, and existing sound-matching / parameter estimation systems are often uninterpretable black boxes or predict high-dimensional framewise parameter values without considering the shape, structure, and routing of the underlying modulation curves. We propose a neural sound-matching approach that leverages modulation extraction, constrained control signal parameterizations, and differentiable digital signal processing (DDSP) to discover the modulations present in a sound. We demonstrate the effectiveness of our approach on highly modulated synthetic and real audio samples, its applicability to different DDSP synth architectures, and investigate the trade-off it incurs between interpretability and sound-matching accuracy. We make our code and audio samples available and provide the trained DDSP synths in a VST plugin.
comment: Accepted to WASPAA 2025 (best paper award candidate). Code, audio samples, and plugins can be found at https://christhetree.github.io/mod_discovery/
☆ Reference Grounded Skill Discovery
Scaling unsupervised skill discovery algorithms to high-DoF agents remains challenging. As dimensionality increases, the exploration space grows exponentially, while the manifold of meaningful skills remains limited. Therefore, semantic meaningfulness becomes essential to effectively guide exploration in high-dimensional spaces. In this work, we present Reference-Grounded Skill Discovery (RGSD), a novel algorithm that grounds skill discovery in a semantically meaningful latent space using reference data. RGSD first performs contrastive pretraining to embed motions on a unit hypersphere, clustering each reference trajectory into a distinct direction. This grounding enables skill discovery to simultaneously involve both imitation of reference behaviors and the discovery of semantically related diverse behaviors. On a simulated SMPL humanoid with 359-D observations and 69-D actions, RGSD learns structured skills including walking, running, punching, and side stepping, and also discovers related novel behaviors. In downstream control tasks, RGSD outperforms imitation-based skill acquisition baselines. Our results suggest that lightweight reference-guided grounding offers a practical path to discovering semantically rich and structured skills in high-DoF systems.
☆ Latent Speech-Text Transformer
Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
comment: 16 pages, 13 figures
☆ On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond
This paper formally studies generation processes, including auto-regressive next-token prediction and masked diffusion, that abstract beyond architectural specifics. At this level of abstraction, we quantify their benefits and limitations through measurable criteria such as computational hardness and learnability. In particular, we demonstrate that allowing generation to proceed beyond autoregression and current masked diffusion, with capabilities to rewrite and length-variable edit, can bring significant theoretical and empirical advantages, with important implications for frontier LLMs that aspire to tackle increasingly hard problems and work universally across domains beyond natural language, such as coding and science.
☆ BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers.
☆ Conformalized Gaussian processes for online uncertainty quantification over graphs
Uncertainty quantification (UQ) over graphs arises in a number of safety-critical applications in network science. The Gaussian process (GP), as a classical Bayesian framework for UQ, has been developed to handle graph-structured data by devising topology-aware kernel functions. However, such GP-based approaches are limited not only by the prohibitive computational complexity, but also the strict modeling assumptions that might yield poor coverage, especially with labels arriving on the fly. To effect scalability, we devise a novel graph-aware parametric GP model by leveraging the random feature (RF)-based kernel approximation, which is amenable to efficient recursive Bayesian model updates. To further allow for adaptivity, an ensemble of graph-aware RF-based scalable GPs have been leveraged, with per-GP weight adapted to data arriving incrementally. To ensure valid coverage with robustness to model mis-specification, we wed the GP-based set predictors with the online conformal prediction framework, which post-processes the prediction sets using adaptive thresholds. Experimental results the proposed method yields improved coverage and efficient prediction sets over existing baselines by adaptively ensembling the GP models and setting the key threshold parameters in CP.
☆ Climate Model Tuning with Online Synchronization-Based Parameter Estimation
In climate science, the tuning of climate models is a computationally intensive problem due to the combination of the high-dimensionality of the system state and long integration times. Here we demonstrate the potential of a parameter estimation algorithm which makes use of synchronization to tune a global atmospheric model at modest computational costs. We first use it to directly optimize internal model parameters. We then apply the algorithm to the weights of each member of a supermodel ensemble to optimize the overall predictions. In both cases, the algorithm is able to find parameters which result in reduced errors in the climatology of the model. Finally, we introduce a novel approach which combines both methods called adaptive supermodeling, where the internal parameters of the members of a supermodel are tuned simultaneously with the model weights such that the supermodel predictions are optimized. For a case designed to challenge the two previous methods, adaptive supermodeling achieves a performance similar to a perfect model.
comment: 19 pages, 11 figures
☆ Differentiable Model Predictive Control on the GPU
Differentiable model predictive control (MPC) offers a powerful framework for combining learning and control. However, its adoption has been limited by the inherently sequential nature of traditional optimization algorithms, which are challenging to parallelize on modern computing hardware like GPUs. In this work, we tackle this bottleneck by introducing a GPU-accelerated differentiable optimization tool for MPC. This solver leverages sequential quadratic programming and a custom preconditioned conjugate gradient (PCG) routine with tridiagonal preconditioning to exploit the problem's structure and enable efficient parallelization. We demonstrate substantial speedups over CPU- and GPU-based baselines, significantly improving upon state-of-the-art training times on benchmark reinforcement learning and imitation learning tasks. Finally, we showcase the method on the challenging task of reinforcement learning for driving at the limits of handling, where it enables robust drifting of a Toyota Supra through water puddles.
☆ Thermodynamic Performance Limits for Score-Based Diffusion Models
We establish a fundamental connection between score-based diffusion models and non-equilibrium thermodynamics by deriving performance limits based on entropy rates. Our main theoretical contribution is a lower bound on the negative log-likelihood of the data that relates model performance to entropy rates of diffusion processes. We numerically validate this bound on a synthetic dataset and investigate its tightness. By building a bridge to entropy rates - system, intrinsic, and exchange entropy - we provide new insights into the thermodynamic operation of these models, drawing parallels to Maxwell's demon and implications for thermodynamic computing hardware. Our framework connects generative modeling performance to fundamental physical principles through stochastic thermodynamics.
☆ Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing
Feature attributions are post-training analysis methods that assess how various input features of a machine learning model contribute to an output prediction. Their interpretation is straightforward when features act independently, but becomes less direct when the predictive model involves interactions such as multiplicative relationships or joint feature contributions. In this work, we propose a general theory of higher-order feature attribution, which we develop on the foundation of Integrated Gradients (IG). This work extends existing frameworks in the literature on explainable AI. When using IG as the method of feature attribution, we discover natural connections to statistics and topological signal processing. We provide several theoretical results that establish the theory, and we validate our theory on a few examples.
comment: 5 pages, 3 figures
☆ TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts
Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional machine learning approaches. While prior-data fitted networks emerge as foundation models for tabular data, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance while exhibiting improved robustness to noise. It seamlessly scales beyond 50,000 features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results show that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world biomedical datasets many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
☆ LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams
A critical challenge in modelling Heterogeneous-Agent Teams is training agents to collaborate with teammates whose policies are inaccessible or non-stationary, such as humans. Traditional approaches rely on expensive human-in-the-loop data, which limits scalability. We propose using Large Language Models (LLMs) as policy-agnostic human proxies to generate synthetic data that mimics human decision-making. To evaluate this, we conduct three experiments in a grid-world capture game inspired by Stag Hunt, a game theory paradigm that balances risk and reward. In Experiment 1, we compare decisions from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and Mixtral 8x22B models. LLMs, prompted with game-state observations and reward structures, align more closely with experts than participants, demonstrating consistency in applying underlying decision criteria. Experiment 2 modifies prompts to induce risk-sensitive strategies (e.g. "be risk averse"). LLM outputs mirror human participants' variability, shifting between risk-averse and risk-seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic grid-world where the LLM agents generate movement actions. LLMs produce trajectories resembling human participants' paths. While LLMs cannot yet fully replicate human adaptability, their prompt-guided diversity offers a scalable foundation for simulating policy-agnostic teammates.
comment: This is a preprint of a paper presented at the \textit{European Conference on Artificial Intelligence (ECAI 2025)}. It is made publicly available for the benefit of the research community and should be regarded as a preprint rather than a formally reviewed publication
Implicit Updates for Average-Reward Temporal Difference Learning
Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($\lambda$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($\lambda$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($\lambda$). In contrast to prior finite-time analyses of average-reward TD($\lambda$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($\lambda$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($\lambda$).
☆ Non-iid hypothesis testing: from classical to quantum
We study hypothesis testing (aka state certification) in the non-identically distributed setting. A recent work (Garg et al. 2023) considered the classical case, in which one is given (independent) samples from $T$ unknown probability distributions $p_1, \dots, p_T$ on $[d] = \{1, 2, \dots, d\}$, and one wishes to accept/reject the hypothesis that their average $p_{\mathrm{avg}}$ equals a known hypothesis distribution $q$. Garg et al. showed that if one has just $c = 2$ samples from each $p_i$, and provided $T \gg \frac{\sqrt{d}}{\epsilon^2} + \frac{1}{\epsilon^4}$, one can (whp) distinguish $p_{\mathrm{avg}} = q$ from $d_{\mathrm{TV}}(p_{\mathrm{avg}},q) > \epsilon$. This nearly matches the optimal result for the classical iid setting (namely, $T \gg \frac{\sqrt{d}}{\epsilon^2}$). Besides optimally improving this result (and generalizing to tolerant testing with more stringent distance measures), we study the analogous problem of hypothesis testing for non-identical quantum states. Here we uncover an unexpected phenomenon: for any $d$-dimensional hypothesis state $\sigma$, and given just a single copy ($c = 1$) of each state $\rho_1, \dots, \rho_T$, one can distinguish $\rho_{\mathrm{avg}} = \sigma$ from $D_{\mathrm{tr}}(\rho_{\mathrm{avg}},\sigma) > \epsilon$ provided $T \gg d/\epsilon^2$. (Again, we generalize to tolerant testing with more stringent distance measures.) This matches the optimal result for the iid case, which is surprising because doing this with $c = 1$ is provably impossible in the classical case. We also show that the analogous phenomenon happens for the non-iid extension of identity testing between unknown states. A technical tool we introduce may be of independent interest: an Efron-Stein inequality, and more generally an Efron-Stein decomposition, in the quantum setting.
comment: 33 pages, 2 figures
☆ Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images
We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.
comment: Project page: https://ap229997.github.io/projects/forehand4d
☆ Improved High-probability Convergence Guarantees of Decentralized SGD
Convergence in high-probability (HP) has been receiving increasing interest, due to its attractive properties, such as exponentially decaying tail bounds and strong guarantees for each individual run of an algorithm. While HP guarantees are extensively studied in centralized settings, much less is understood in the decentralized, networked setup. Existing HP studies in decentralized settings impose strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise, resulting in a significant gap between assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, even for vanilla Decentralized Stochastic Gradient Descent ($\mathtt{DSGD}$) algorithm. This is contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed to guarantee MSE convergence. Motivated by this observation, we revisit HP guarantees for $\mathtt{DSGD}$ in the presence of light-tailed noise. We show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing uniformly bounded gradients and other restrictive assumptions, while simultaneously achieving order-optimal rates for both non-convex and strongly convex costs. Moreover, our improved analysis yields linear speed-up in the number of users, demonstrating that $\mathtt{DSGD}$ maintains strong performance in the HP sense and matches existing MSE guarantees. Our improved results stem from a careful analysis of the MGF of quantities of interest (norm-squared of gradient or optimality gap) and the MGF of the consensus gap between users' models. To achieve linear speed-up, we provide a novel result on the variance-reduction effect of decentralized methods in the HP sense and more fine-grained bounds on the MGF for strongly convex costs, which are both of independent interest.
comment: 39 pages
☆ Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks
Multi-task reinforcement learning often relies on task metadata -- such as brief natural-language descriptions -- to guide behavior across diverse objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned mixture-of-policies architecture for multi-task RL. LEXPOL encodes task metadata with a text encoder and uses a learned gating module to select or blend among multiple sub-policies, enabling end-to-end training across tasks. On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines in success rate and sample efficiency, without task-specific retraining. To analyze the mechanism, we further study settings with fixed expert policies obtained independently of the gate and show that the learned language gate composes these experts to produce behaviors appropriate to novel task descriptions and unseen task combinations. These results indicate that natural-language metadata can effectively index and recombine reusable skills within a single policy.
comment: 14 pages, 3 figures, 12 tables, 2 appendices. Currently under review
☆ lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models
Large Language Models (LLMs) are increasingly integrated into everyday applications, but their prevalent cloud-based deployment raises growing concerns around data privacy and long-term sustainability. Running LLMs locally on mobile and edge devices (on-device LLMs) offers the promise of enhanced privacy, reliability, and reduced communication costs. However, realizing this vision remains challenging due to substantial memory and compute demands, as well as limited visibility into performance-efficiency trade-offs on resource-constrained hardware. We propose lm-Meter, the first lightweight, online latency profiler tailored for on-device LLM inference. lm-Meter captures fine-grained, real-time latency at both phase (e.g., embedding, prefill, decode, softmax, sampling) and kernel levels without auxiliary devices. We implement lm-Meter on commercial mobile platforms and demonstrate its high profiling accuracy with minimal system overhead, e.g., only 2.58% throughput reduction in prefill and 0.99% in decode under the most constrained Powersave governor. Leveraging lm-Meter, we conduct comprehensive empirical studies revealing phase- and kernel-level bottlenecks in on-device LLM inference, quantifying accuracy-efficiency trade-offs, and identifying systematic optimization opportunities. lm-Meter provides unprecedented visibility into the runtime behavior of LLMs on constrained platforms, laying the foundation for informed optimization and accelerating the democratization of on-device LLM systems. Code and tutorials are available at https://github.com/amai-gsu/LM-Meter.
comment: This is the preprint version of the paper accepted to The 10th ACM/IEEE Symposium on Edge Computing (SEC 2025)
☆ Downsized and Compromised?: Assessing the Faithfulness of Model Compression
In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model compression. While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying quantization and pruning to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through compression do not compromise the fairness or faithfulness essential for trustworthy AI.
comment: Submitted to and under review at Springer Machine Learning Journal
☆ PolyGraph Discrepancy: a classifier-based metric for graph generation
Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors. We introduce PolyGraph Discrepancy (PGD), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting metrics are constrained to the unit interval [0,1] and are comparable across different graph descriptors. We further derive a theoretically grounded summary metric that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGD provides a more robust and insightful evaluation compared to MMD metrics. The PolyGraph framework for benchmarking graph generative models is made publicly available at https://github.com/BorgwardtLab/polygraph-benchmark.
☆ Influence Functions for Efficient Data Selection in Reasoning
Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on indirect heuristics such as problem difficulty or trace length, while instruction-tuning has explored a broader range of automated selection strategies, but rarely in the context of reasoning. We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy, and introduce influence-based pruning, which consistently outperforms perplexity and embedding-based baselines on math reasoning within a model family.
☆ The Physics of Data and Tasks: Theories of Locality and Compositionality in Deep Learning
Deep neural networks have achieved remarkable success, yet our understanding of how they learn remains limited. These models can learn high-dimensional tasks, which is generally statistically intractable due to the curse of dimensionality. This apparent paradox suggests that learnable data must have an underlying latent structure. What is the nature of this structure? How do neural networks encode and exploit it, and how does it quantitatively impact performance - for instance, how does generalization improve with the number of training examples? This thesis addresses these questions by studying the roles of locality and compositionality in data, tasks, and deep learning representations.
comment: PhD dissertation. Preprint
☆ Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences
Large language models (LLMs) are increasingly shaping how information is created and disseminated, from companies using them to craft persuasive advertisements, to election campaigns optimizing messaging to gain votes, to social media influencers boosting engagement. These settings are inherently competitive, with sellers, candidates, and influencers vying for audience approval, yet it remains poorly understood how competitive feedback loops influence LLM behavior. We show that optimizing LLMs for competitive success can inadvertently drive misalignment. Using simulated environments across these scenarios, we find that, 6.3% increase in sales is accompanied by a 14.0% rise in deceptive marketing; in elections, a 4.9% gain in vote share coincides with 22.3% more disinformation and 12.5% more populist rhetoric; and on social media, a 7.5% engagement boost comes with 188.6% more disinformation and a 16.3% increase in promotion of harmful behaviors. We call this phenomenon Moloch's Bargain for AI--competitive success achieved at the cost of alignment. These misaligned behaviors emerge even when models are explicitly instructed to remain truthful and grounded, revealing the fragility of current alignment safeguards. Our findings highlight how market-driven optimization pressures can systematically erode alignment, creating a race to the bottom, and suggest that safe deployment of AI systems will require stronger governance and carefully designed incentives to prevent competitive dynamics from undermining societal trust.
☆ The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.
comment: Preprint
☆ Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences, yet the underlying reward signals they internalize remain hidden, posing a critical challenge for interpretability and safety. Existing approaches attempt to extract these latent incentives using Inverse Reinforcement Learning (IRL), but treat all preference pairs equally, often overlooking the most informative signals: those examples the extracted reward model misclassifies or assigns nearly equal scores, which we term \emph{failures}. We introduce a novel \emph{failure-aware} IRL algorithm that focuses on misclassified or difficult examples to recover the latent rewards defining model behaviors. By learning from these failures, our failure-aware IRL extracts reward functions that better reflect the true objectives behind RLHF. We demonstrate that failure-aware IRL outperforms existing IRL baselines across multiple metrics when applied to LLM detoxification, without requiring external classifiers or supervision. Crucially, failure-aware IRL yields rewards that better capture the true incentives learned during RLHF, enabling more effective re-RLHF training than standard IRL. This establishes failure-aware IRL as a robust, scalable method for auditing model alignment and reducing ambiguity in the IRL process.
comment: Preprint
☆ Learning Mixtures of Linear Dynamical Systems (MoLDS) via Hybrid Tensor-EM Method
Mixtures of linear dynamical systems (MoLDS) provide a path to model time-series data that exhibit diverse temporal dynamics across trajectories. However, its application remains challenging in complex and noisy settings, limiting its effectiveness for neural data analysis. Tensor-based moment methods can provide global identifiability guarantees for MoLDS, but their performance degrades under noise and complexity. Commonly used expectation-maximization (EM) methods offer flexibility in fitting latent models but are highly sensitive to initialization and prone to poor local minima. Here, we propose a tensor-based method that provides identifiability guarantees for learning MoLDS, which is followed by EM updates to combine the strengths of both approaches. The novelty in our approach lies in the construction of moment tensors using the input-output data to recover globally consistent estimates of mixture weights and system parameters. These estimates can then be refined through a Kalman EM algorithm, with closed-form updates for all LDS parameters. We validate our framework on synthetic benchmarks and real-world datasets. On synthetic data, the proposed Tensor-EM method achieves more reliable recovery and improved robustness compared to either pure tensor or randomly initialized EM methods. We then analyze neural recordings from the primate somatosensory cortex while a non-human primate performs reaches in different directions. Our method successfully models and clusters different conditions as separate subsystems, consistent with supervised single-LDS fits for each condition. Finally, we apply this approach to another neural dataset where monkeys perform a sequential reaching task. These results demonstrate that MoLDS provides an effective framework for modeling complex neural data, and that Tensor-EM is a reliable approach to MoLDS learning for these applications.
comment: 20 pages, 7 figures
☆ EmoHRNet: High-Resolution Neural Network Based Speech Emotion Recognition
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.
☆ Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks IEEE VIS 2025
AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.
comment: 9 pages, 3 figures, short paper accepted at VISxGenAI: 1st Workshop on GenAI, Agents, and the Future of VIS (IEEE VIS 2025)
☆ Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks
In this paper, we study the factors that contribute to the effect of oversmoothing in deep Graph Neural Networks (GNNs). Specifically, our analysis is based on a new metric (Mean Average Squared Distance - $MASED$) to quantify the extent of oversmoothing. We derive layer-wise bounds on $MASED$, which aggregate to yield global upper and lower distance bounds. Based on this quantification of oversmoothing, we further analyze the importance of two different properties of the model; namely the norms of the generated node embeddings, along with the largest and smallest singular values of the weight matrices. Building on the insights drawn from the theoretical analysis, we show that oversmoothing increases as the number of trainable weight matrices and the number of adjacency matrices increases. We also use the derived layer-wise bounds on $MASED$ to form a proposal for decoupling the number of hops (i.e., adjacency depth) from the number of weight matrices. In particular, we introduce G-Reg, a regularization scheme that increases the bounds, and demonstrate through extensive experiments that by doing so node classification accuracy increases, achieving robustness at large depths. We further show that by reducing oversmoothing in deep networks, we can achieve better results in some tasks than using shallow ones. Specifically, we experiment with a ``cold start" scenario, i.e., when there is no feature information for the unlabeled nodes. Finally, we show empirically the trade-off between receptive field size (i.e., number of weight matrices) and performance, using the $MASED$ bounds. This is achieved by distributing adjacency hops across a small number of trainable layers, avoiding the extremes of under- or over-parameterization of the GNN.
☆ TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis
Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as weather, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets are underrepresented in public benchmarks due to proprietary restrictions. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks beyond forecasting, such as anomaly detection, root-cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit scale information and supports a suite of downstream tasks, including anomaly detection, root-cause analysis, and a question-answering benchmark requiring multi-modal reasoning. Benchmarking state-of-the-art time series, language, and reasoning models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics of observability data. Our experiments also underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical observability applications.
☆ Medical Vision Language Models as Policies for Robotic Surgery
Vision-based Proximal Policy Optimization (PPO) struggles with visual observation-based robotic laparoscopic surgical tasks due to the high-dimensional nature of visual input, the sparsity of rewards in surgical environments, and the difficulty of extracting task-relevant features from raw visual data. We introduce a simple approach integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. Our method is evaluated on five diverse laparoscopic surgery task environments in LapGym, using only endoscopic visual observations. MedFlamingo PPO outperforms and converges faster compared to both standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates exceeding 70% across all environments, with improvements ranging from 66.67% to 1114.29% compared to baseline. By processing task observations and instructions once per episode to generate high-level planning tokens, our method efficiently combines medical expertise with real-time visual feedback. Our results highlight the value of specialized medical knowledge in robotic surgical planning and decision-making.
comment: IEEE CAI 2025
☆ Edit-Based Flow Matching for Temporal Point Processes
Temporal point processes (TPPs) are a fundamental tool for modeling event sequences in continuous time, but most existing approaches rely on autoregressive parameterizations that are limited by their sequential sampling. Recent non-autoregressive, diffusion-style models mitigate these issues by jointly interpolating between noise and data through event insertions and deletions in a discrete Markov chain. In this work, we generalize this perspective and introduce an Edit Flow process for TPPs that transports noise to data via insert, delete, and substitute edit operations. By learning the instantaneous edit rates within a continuous-time Markov chain framework, we attain a flexible and efficient model that effectively reduces the total number of necessary edit operations during generation. Empirical results demonstrate the generative flexibility of our unconditionally trained model in a wide range of unconditional and conditional generation tasks on benchmark TPPs.
☆ BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining
Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.
☆ From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning
Autonomous driving with reinforcement learning (RL) has significant potential. However, applying RL in real-world settings remains challenging due to the need for safe, efficient, and robust learning. Incorporating human expertise into the learning process can help overcome these challenges by reducing risky exploration and improving sample efficiency. In this work, we propose a reward-free, active human-in-the-loop learning method called Human-Guided Distributional Soft Actor-Critic (H-DSAC). Our method combines Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC) to enable efficient and safe training in real-world environments. The key innovation is the construction of a distributed proxy value function within the DSAC framework. This function encodes human intent by assigning higher expected returns to expert demonstrations and penalizing actions that require human intervention. By extrapolating these labels to unlabeled states, the policy is effectively guided toward expert-like behavior. With a well-designed state space, our method achieves real-world driving policy learning within practical training times. Results from both simulation and real-world experiments demonstrate that our framework enables safe, robust, and sample-efficient learning for autonomous driving.
☆ Adaptive Pruning for Increased Robustness and Reduced Computational Overhead in Gaussian Process Accelerated Saddle Point Searches
Gaussian process (GP) regression provides a strategy for accelerating saddle point searches on high-dimensional energy surfaces by reducing the number of times the energy and its derivatives with respect to atomic coordinates need to be evaluated. The computational overhead in the hyperparameter optimization can, however, be large and make the approach inefficient. Failures can also occur if the search ventures too far into regions that are not represented well enough by the GP model. Here, these challenges are resolved by using geometry-aware optimal transport measures and an active pruning strategy using a summation over Wasserstein-1 distances for each atom-type in farthest-point sampling, selecting a fixed-size subset of geometrically diverse configurations to avoid rapidly increasing cost of GP updates as more observations are made. Stability is enhanced by permutation-invariant metric that provides a reliable trust radius for early-stopping and a logarithmic barrier penalty for the growth of the signal variance. These physically motivated algorithmic changes prove their efficacy by reducing to less than a half the mean computational time on a set of 238 challenging configurations from a previously published data set of chemical reactions. With these improvements, the GP approach is established as, a robust and scalable algorithm for accelerating saddle point searches when the evaluation of the energy and atomic forces requires significant computational effort.
comment: Invited article for the ChemPhysChem special issue dedicated to the 60th birthday of Prof. Debabrata Goswami. A preliminary version of this work was presented at the UNOOS 2025 conference
Fast Leave-One-Out Approximation from Fragment-Target Prevalence Vectors (molFTP) : From Dummy Masking to Key-LOO for Leakage-Free Feature Construction
We introduce molFTP (molecular fragment-target prevalence), a compact representation that delivers strong predictive performance. To prevent feature leakage across cross-validation folds, we implement a dummy-masking procedure that removes information about fragments present in the held-out molecules. We further show that key leave-one-out (key-loo) closely approximates true molecule-level leave-one-out (LOO), with deviation below 8% on our datasets. This enables near full data training while preserving unbiased cross-validation estimates of model performance. Overall, molFTP provides a fast, leakage-resistant fragment-target prevalence vectorization with practical safeguards (dummy masking or key-LOO) that approximate LOO at a fraction of its cost.
comment: 28 pages, 21 figures, 3 tables
☆ Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
The paper provides data-dependent bounds on the test error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The bounds are stable under approximation with Langevin Monte Carlo algorithms. Experiments on the MNIST and CIFAR-10 datasets verify that the bounds yield nontrivial predictions on true labeled data and correctly upper bound the test error for random labels. Our method indicates that generalization in the low-temperature, interpolation regime is already signaled by small training errors in the more classical high temperature regime.
☆ Emergent AI Surveillance: Overlearned Person Re-Identification and Its Mitigation in Law Enforcement Context
Generic instance search models can dramatically reduce the manual effort required to analyze vast surveillance footage during criminal investigations by retrieving specific objects of interest to law enforcement. However, our research reveals an unintended emergent capability: through overlearning, these models can single out specific individuals even when trained on datasets without human subjects. This capability raises concerns regarding identification and profiling of individuals based on their personal data, while there is currently no clear standard on how de-identification can be achieved. We evaluate two technical safeguards to curtail a model's person re-identification capacity: index exclusion and confusion loss. Our experiments demonstrate that combining these approaches can reduce person re-identification accuracy to below 2% while maintaining 82% of retrieval performance for non-person objects. However, we identify critical vulnerabilities in these mitigations, including potential circumvention using partial person images. These findings highlight urgent regulatory questions at the intersection of AI governance and data protection: How should we classify and regulate systems with emergent identification capabilities? And what technical standards should be required to prevent identification capabilities from developing in seemingly benign applications?
comment: 10 pages, accepted to AIES 2025
☆ Out-of-Distribution Detection from Small Training Sets using Bayesian Neural Network Classifiers BMVC
Out-of-Distribution (OOD) detection is critical to AI reliability and safety, yet in many practical settings, only a limited amount of training data is available. Bayesian Neural Networks (BNNs) are a promising class of model on which to base OOD detection, because they explicitly represent epistemic (i.e. model) uncertainty. In the small training data regime, BNNs are especially valuable because they can incorporate prior model information. We introduce a new family of Bayesian posthoc OOD scores based on expected logit vectors, and compare 5 Bayesian and 4 deterministic posthoc OOD scores. Experiments on MNIST and CIFAR-10 In-Distributions, with 5000 training samples or less, show that the Bayesian methods outperform corresponding deterministic methods.
comment: British Machine Vision Conference (BMVC) 2025; 18 pages, 6 figures, 3 tables
☆ RamPINN: Recovering Raman Spectra From Coherent Anti-Stokes Spectra Using Embedded Physics
Transferring the recent advancements in deep learning into scientific disciplines is hindered by the lack of the required large-scale datasets for training. We argue that in these knowledge-rich domains, the established body of scientific theory provides reliable inductive biases in the form of governing physical laws. We address the ill-posed inverse problem of recovering Raman spectra from noisy Coherent Anti-Stokes Raman Scattering (CARS) measurements, as the true Raman signal here is suppressed by a dominating non-resonant background. We propose RamPINN, a model that learns to recover Raman spectra from given CARS spectra. Our core methodological contribution is a physics-informed neural network that utilizes a dual-decoder architecture to disentangle resonant and non-resonant signals. This is done by enforcing the Kramers-Kronig causality relations via a differentiable Hilbert transform loss on the resonant and a smoothness prior on the non-resonant part of the signal. Trained entirely on synthetic data, RamPINN demonstrates strong zero-shot generalization to real-world experimental data, explicitly closing this gap and significantly outperforming existing baselines. Furthermore, we show that training with these physics-based losses alone, without access to any ground-truth Raman spectra, still yields competitive results. This work highlights a broader concept: formal scientific rules can act as a potent inductive bias, enabling robust, self-supervised learning in data-limited scientific domains.
☆ Hybrid Quantum-Classical Policy Gradient for Adaptive Control of Cyber-Physical Systems: A Comparative Study of VQC vs. MLP
The comparative evaluation between classical and quantum reinforcement learning (QRL) paradigms was conducted to investigate their convergence behavior, robustness under observational noise, and computational efficiency in a benchmark control environment. The study employed a multilayer perceptron (MLP) agent as a classical baseline and a parameterized variational quantum circuit (VQC) as a quantum counterpart, both trained on the CartPole-v1 environment over 500 episodes. Empirical results demonstrated that the classical MLP achieved near-optimal policy convergence with a mean return of 498.7 +/- 3.2, maintaining stable equilibrium throughout training. In contrast, the VQC exhibited limited learning capability, with an average return of 14.6 +/- 4.8, primarily constrained by circuit depth and qubit connectivity. Noise robustness analysis further revealed that the MLP policy deteriorated gracefully under Gaussian perturbations, while the VQC displayed higher sensitivity at equivalent noise levels. Despite the lower asymptotic performance, the VQC exhibited significantly lower parameter count and marginally increased training time, highlighting its potential scalability for low-resource quantum processors. The results suggest that while classical neural policies remain dominant in current control benchmarks, quantum-enhanced architectures could offer promising efficiency advantages once hardware noise and expressivity limitations are mitigated.
comment: 6 pages, 5 figures, 2 tables, 17 equations, 1 algorithm
☆ Uncertainty in Machine Learning
This book chapter introduces the principles and practical applications of uncertainty quantification in machine learning. It explains how to identify and distinguish between different types of uncertainty and presents methods for quantifying uncertainty in predictive models, including linear regression, random forests, and neural networks. The chapter also covers conformal prediction as a framework for generating predictions with predefined confidence intervals. Finally, it explores how uncertainty estimation can be leveraged to improve business decision-making, enhance model reliability, and support risk-aware strategies.
comment: Authored by Hans Weytjens. Wouter Verbeke provided proofreading and served as the chief editor of the book in which this chapter appears
☆ Information-Theoretic Policy Pre-Training with Empowerment
Empowerment, an information-theoretic measure of an agent's potential influence on its environment, has emerged as a powerful intrinsic motivation and exploration framework for reinforcement learning (RL). Besides for unsupervised RL and skill learning algorithms, the specific use of empowerment as a pre-training signal has received limited attention in the literature. We show that empowerment can be used as a pre-training signal for data-efficient downstream task adaptation. For this we extend the traditional notion of empowerment by introducing discounted empowerment, which balances the agent's control over the environment across short- and long-term horizons. Leveraging this formulation, we propose a novel pre-training paradigm that initializes policies to maximize discounted empowerment, enabling agents to acquire a robust understanding of environmental dynamics. We analyze empowerment-based pre-training for various existing RL algorithms and empirically demonstrate its potential as a general-purpose initialization strategy: empowerment-maximizing policies with long horizons are data-efficient and effective, leading to improved adaptability in downstream tasks. Our findings pave the way for future research to scale this framework to high-dimensional and complex tasks, further advancing the field of RL.
☆ Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
Large Language Models (LLMs) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post-generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the decoding rule should be calibrated by correctness, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps. Calibrated-TopK and Calibrated-epsilon set truncation threshold based on estimated rank-wise correctness. Together, our findings challenge prevailing heuristics about decoding under uncertainty and show gains across math and general reasoning benchmarks.
Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis
Low-light image enhancement (LLIE) is vital for safety-critical applications such as surveillance, autonomous navigation, and medical imaging, where visibility degradation can impair downstream task performance. Recently, diffusion models have emerged as a promising generative paradigm for LLIE due to their capacity to model complex image distributions via iterative denoising. This survey provides an up-to-date critical analysis of diffusion models for LLIE, distinctively featuring an in-depth comparative performance evaluation against Generative Adversarial Network and Transformer-based state-of-the-art methods, a thorough examination of practical deployment challenges, and a forward-looking perspective on the role of emerging paradigms like foundation models. We propose a multi-perspective taxonomy encompassing six categories: Intrinsic Decomposition, Spectral & Latent, Accelerated, Guided, Multimodal, and Autonomous; that map enhancement methods across physical priors, conditioning schemes, and computational efficiency. Our taxonomy is grounded in a hybrid view of both the model mechanism and the conditioning signals. We evaluate qualitative failure modes, benchmark inconsistencies, and trade-offs between interpretability, generalization, and inference efficiency. We also discuss real-world deployment constraints (e.g., memory, energy use) and ethical considerations. This survey aims to guide the next generation of diffusion-based LLIE research by highlighting trends and surfacing open research questions, including novel conditioning, real-time adaptation, and the potential of foundation models.
Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $x$ efficiently and in closed-form using the model's Jacobian matrix at $x$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.
☆ N-Parties Private Structure and Parameter Learning for Sum-Product Networks
A sum-product network (SPN) is a graphical model that allows several types of probabilistic inference to be performed efficiently. In this paper, we propose a privacy-preserving protocol which tackles structure generation and parameter learning of SPNs. Additionally, we provide a protocol for private inference on SPNs, subsequent to training. To preserve the privacy of the participants, we derive our protocol based on secret sharing, which guarantees privacy in the honest-but-curious setting even when at most half of the parties cooperate to disclose the data. The protocol makes use of a forest of randomly generated SPNs, which is trained and weighted privately and can then be used for private inference on data points. Our experiments indicate that preserving the privacy of all participants does not decrease log-likelihood performance on both homogeneously and heterogeneously partitioned data. We furthermore show that our protocol's performance is comparable to current state-of-the-art SPN learners in homogeneously partitioned data settings. In terms of runtime and memory usage, we demonstrate that our implementation scales well when increasing the number of parties, comparing favorably to protocols for neural networks, when they are trained to reproduce the input-output behavior of SPNs.
☆ EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models
Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training, and agentic RL extends this paradigm to operate as agents through multi-turn interaction and tool use. Scaling such systems exposes two practical bottlenecks: (1) context length grows rapidly during training, inflating memory usage and latency, and triggering out-of-memory (OOM) failures; and (2) intermediate tensors accumulate with context length, making cross-device data movement a major system bottleneck. We present EARL, a scalable system for efficient agentic RL. EARL designs a parallelism selector that dynamically adapts model and training parallelism across RL stages based on sequence length and system load, and a data dispatcher that performs layout-aware, decentralized exchange of intermediate data batches. Together, these components increase throughput, reduce long-context failures, and enable stable large-scale training of agentic LLMs without relying on hard limits or penalties of context length.
☆ LLM-FS-Agent: A Deliberative Role-based Large Language Model Architecture for Transparent Feature Selection
High-dimensional data remains a pervasive challenge in machine learning, often undermining model interpretability and computational efficiency. While Large Language Models (LLMs) have shown promise for dimensionality reduction through feature selection, existing LLM-based approaches frequently lack structured reasoning and transparent justification for their decisions. This paper introduces LLM-FS-Agent, a novel multi-agent architecture designed for interpretable and robust feature selection. The system orchestrates a deliberative "debate" among multiple LLM agents, each assigned a specific role, enabling collective evaluation of feature relevance and generation of detailed justifications. We evaluate LLM-FS-Agent in the cybersecurity domain using the CIC-DIAD 2024 IoT intrusion detection dataset and compare its performance against strong baselines, including LLM-Select and traditional methods such as PCA. Experimental results demonstrate that LLM-FS-Agent consistently achieves superior or comparable classification performance while reducing downstream training time by an average of 46% (statistically significant improvement, p = 0.028 for XGBoost). These findings highlight that the proposed deliberative architecture enhances both decision transparency and computational efficiency, establishing LLM-FS-Agent as a practical and reliable solution for real-world applications.
☆ Carré du champ flow matching: better quality-generalisation tradeoff in generative models
Deep generative models often face a fundamental tradeoff: high sample quality can come at the cost of memorisation, where the model reproduces training data rather than generalising across the underlying data geometry. We introduce Carr\'e du champ flow matching (CDC-FM), a generalisation of flow matching (FM), that improves the quality-generalisation tradeoff by regularising the probability path with a geometry-aware noise. Our method replaces the homogeneous, isotropic noise in FM with a spatially varying, anisotropic Gaussian noise whose covariance captures the local geometry of the latent data manifold. We prove that this geometric noise can be optimally estimated from the data and is scalable to large data. Further, we provide an extensive experimental evaluation on diverse datasets (synthetic manifolds, point clouds, single-cell genomics, animal motion capture, and images) as well as various neural network architectures (MLPs, CNNs, and transformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FM in data-scarce regimes and in highly non-uniformly sampled datasets, which are often encountered in AI for science applications. Our work provides a mathematical framework for studying the interplay between data geometry, generalisation and memorisation in generative models, as well as a robust and scalable algorithm that can be readily integrated into existing flow matching pipelines.
☆ Prompt reinforcing for long-term planning of large language models
Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.
☆ An Attention-Augmented VAE-BiLSTM Framework for Anomaly Detection in 12-Lead ECG Signals
Anomaly detection in 12-lead electrocardiograms (ECGs) is critical for identifying deviations associated with cardiovascular disease. This work presents a comparative analysis of three autoencoder-based architectures: convolutional autoencoder (CAE), variational autoencoder with bidirectional long short-term memory (VAE-BiLSTM), and VAE-BiLSTM with multi-head attention (VAE-BiLSTM-MHA), for unsupervised anomaly detection in ECGs. To the best of our knowledge, this study reports the first application of a VAE-BiLSTM-MHA architecture to ECG anomaly detection. All models are trained on normal ECG samples to reconstruct non-anomalous cardiac morphology and detect deviations indicative of disease. Using a unified preprocessing and evaluation pipeline on the public China Physiological Signal Challenge (CPSC) dataset, the attention-augmented VAE achieves the best performance, with an AUPRC of 0.81 and a recall of 0.85 on the held-out test set, outperforming the other architectures. To support clinical triage, this model is further integrated into an interactive dashboard that visualizes anomaly localization. In addition, a performance comparison with baseline models from the literature is provided.
comment: 14 pages, 11 figures
☆ Kaputt: A Large-Scale Dataset for Visual Defect Detection ICCV 2025
We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec-AD and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.
comment: Accepted to ICCV 2025
☆ Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods
Transformers' quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.
☆ Segment-Factorized Full-Song Generation on Symbolic Piano Music NeurIPS 2025
We propose the Segmented Full-Song Model (SFS) for symbolic full-song generation. The model accepts a user-provided song structure and an optional short seed segment that anchors the main idea around which the song is developed. By factorizing a song into segments and generating each one through selective attention to related segments, the model achieves higher quality and efficiency compared to prior work. To demonstrate its suitability for human-AI interaction, we further wrap SFS into a web application that enables users to iteratively co-create music on a piano roll with customizable structures and flexible ordering.
comment: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: AI for Music
☆ OBSR: Open Benchmark for Spatial Representations
GeoAI is evolving rapidly, fueled by diverse geospatial datasets like traffic patterns, environmental data, and crowdsourced OpenStreetMap (OSM) information. While sophisticated AI models are being developed, existing benchmarks are often concentrated on single tasks and restricted to a single modality. As such, progress in GeoAI is limited by the lack of a standardized, multi-task, modality-agnostic benchmark for their systematic evaluation. This paper introduces a novel benchmark designed to assess the performance, accuracy, and efficiency of geospatial embedders. Our benchmark is modality-agnostic and comprises 7 distinct datasets from diverse cities across three continents, ensuring generalizability and mitigating demographic biases. It allows for the evaluation of GeoAI embedders on various phenomena that exhibit underlying geographic processes. Furthermore, we establish a simple and intuitive task-oriented model baselines, providing a crucial reference point for comparing more complex solutions.
comment: ACM SIGSPATIAL 2025 Full Paper
☆ MaNGO - Adaptable Graph Network Simulators via Meta-Learning NeurIPS 2025
Accurately simulating physics is crucial across scientific domains, with applications spanning from robotics to materials science. While traditional mesh-based simulations are precise, they are often computationally expensive and require knowledge of physical parameters, such as material properties. In contrast, data-driven approaches like Graph Network Simulators (GNSs) offer faster inference but suffer from two key limitations: Firstly, they must be retrained from scratch for even minor variations in physical parameters, and secondly they require labor-intensive data collection for each new parameter setting. This is inefficient, as simulations with varying parameters often share a common underlying latent structure. In this work, we address these challenges by learning this shared structure through meta-learning, enabling fast adaptation to new physical parameters without retraining. To this end, we propose a novel architecture that generates a latent representation by encoding graph trajectories using conditional neural processes (CNPs). To mitigate error accumulation over time, we combine CNPs with a novel neural operator architecture. We validate our approach, Meta Neural Graph Operator (MaNGO), on several dynamics prediction tasks with varying material properties, demonstrating superior performance over existing GNS methods. Notably, MaNGO achieves accuracy on unseen material properties close to that of an oracle model.
comment: 19 pages including appendix. NeurIPS 2025 (preprint version)
☆ Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering
Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs), improving generalization by providing step-level supervision. Yet most approaches require ground-truth labels to seed or filter these traces - an expensive bottleneck in domains like biology where wet-lab data are scarce. We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence - quantified through established uncertainty metrics like self-consistency and predictive perplexity - as a substitute for external labels. We sample multiple reasoning traces and retain only low-uncertainty subsets. Applied to biological perturbation prediction, a domain where wet-lab labels are especially costly, we show that the filtered subset has higher accuracy, and that supervised fine-tuning (SFT) on uncertainty-filtered data outperforms unfiltered synthetic data, narrows the gap to ground-truth training, and surpasses strong LRM baselines. Ablations show that per-class filtering corrects for class-specific uncertainty scales and that hybrid uncertainty metrics yield higher-quality datasets. Our results suggest that model-internal confidence is a powerful signal for efficient reasoning dataset creation, enabling LRMs in domains where supervision is expensive.
☆ DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization
Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains %or conversational data that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.
comment: Accepted to the NewSumm Workshop at EMNLP 2025
☆ How to model Human Actions distribution with Event Sequence Data
This paper studies forecasting of the future distribution of events in human action sequences, a task essential in domains like retail, finance, healthcare, and recommendation systems where the precise temporal order is often less critical than the set of outcomes. We challenge the dominant autoregressive paradigm and investigate whether explicitly modeling the future distribution or order-invariant multi-token approaches outperform order-preserving methods. We analyze local order invariance and introduce a KL-based metric to quantify temporal drift. We find that a simple explicit distribution forecasting objective consistently surpasses complex implicit baselines. We further demonstrate that mode collapse of predicted categories is primarily driven by distributional imbalance. This work provides a principled framework for selecting modeling strategies and offers practical guidance for building more accurate and robust forecasting systems.
comment: 9 pages main text + 2 pages references + 6 pages appendix, 10 figures, 3 tables. Preprint version
☆ ESS-Flow: Training-free guidance of flow-based models as inference in source space
Guiding pretrained flow-based generative models for conditional generation or to produce samples with desired target properties enables solving diverse tasks without retraining on paired data. We present ESS-Flow, a gradient-free method that leverages the typically Gaussian prior of the source distribution in flow-based models to perform Bayesian inference directly in the source space using Elliptical Slice Sampling. ESS-Flow only requires forward passes through the generative model and observation process, no gradient or Jacobian computations, and is applicable even when gradients are unreliable or unavailable, such as with simulation-based observations or quantization in the generation or observation process. We demonstrate its effectiveness on designing materials with desired target properties and predicting protein structures from sparse inter-residue distance measurements.
comment: 14 pages, 12 figures. Code will be made available after publication
☆ Multimodal Trajectory Representation Learning for Travel Time Estimation
Accurate travel time estimation (TTE) plays a crucial role in intelligent transportation systems. However, it remains challenging due to heterogeneous data sources and complex traffic dynamics. Moreover, conventional approaches typically convert trajectories into fixed-length representations, neglecting the inherent variability of real-world trajectories, which often leads to information loss or feature redundancy. To address these challenges, this paper introduces the Multimodal Dynamic Trajectory Integration (MDTI) framework--a novel multimodal trajectory representation learning approach that integrates GPS sequences, grid trajectories, and road network constraints to enhance TTE accuracy. MDTI employs modality-specific encoders and a cross-modal interaction module to capture complementary spatial, temporal, and topological semantics, while a dynamic trajectory modeling mechanism adaptively regulates information density for trajectories of varying lengths. Two self-supervised pretraining objectives, named contrastive alignment and masked language modeling, further strengthen multimodal consistency and contextual understanding. Extensive experiments on three real-world datasets demonstrate that MDTI consistently outperforms state-of-the-art baselines, confirming its robustness and strong generalization abilities. The code is publicly available at: https://github.com/freshhxy/MDTI/
☆ FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.
comment: Acepted at IJCNN 2025
☆ StereoSync: Spatially-Aware Stereo Audio Generation from Video
Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.
comment: Accepted at IJCNN 2025
☆ Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling
Inference-Time Scaling (ITS) improves language models by allocating more computation at generation time. Particle Filtering (PF) has emerged as a strong ITS method for complex mathematical reasoning tasks, but it is vulnerable when guided by process reward models, which often assign overconfident scores early in the reasoning process. This causes PF to suffer from premature exploitation: it myopically commits to locally promising trajectories, prunes potentially correct hypotheses, and converges to suboptimal solutions. This failure mode, known as particle impoverishment, is especially severe under constrained computational budgets. To address this, we analyze the problem and identify two root causes: a lack of diversity in the particle set due to overconfident resampling and consequent inability to assess the potential of a reasoning path. We introduce Entropic Particle Filtering (ePF), an algorithm that integrates two new techniques to solve these issues. The first technique, Entropic Annealing (EA), directly mitigates particle impoverishment by monitoring search diversity via entropy; when diversity drops, it intervenes by dynamically annealing the resampling distribution to preserve exploration. The second, an enhancement called Look-ahead Modulation (LaM), adds a predictive guide to evaluate a state's potential based on its successors. On several challenging math benchmarks, ePF significantly outperforms strong baselines and achieves up to a 50 % relative improvement in task reward. Together, these methods improve PF's resilience by balancing the exploration of diverse solution spaces with the exploitation of high-reward regions, ultimately leading to higher-quality solutions.
☆ Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates AISTATS 2026
Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B\'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B\'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.
comment: 20 pages, 4 figures, Submitted to AISTATS 2026
☆ Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding
We present the Mellum models family, open-weight code completion models designed for interactive use in JetBrains IDEs. Mellums have 4B parameters, adopt a Llama-style architecture, and are pre-trained on ~4T tokens of permissively licensed, multi-language code. Our studies show that (i) careful data curation and staged training significantly improve the model's quality, (ii) editor-critical capabilities such as context packing are necessary for high-quality suggestions, and (iii) a compact, task-focused model can meet the cost and latency constraints of interactive completion. In the paper, we describe an end-to-end industrial pipeline for producing contextualized in-editor completion: disciplined data governance, multi-stage training that includes fill-in-the-middle and project context via supervised fine-tuning, and alignment via direct preference optimization using feedback from real-world scenarios. Our quality evaluations include both large-scale offline benchmarks and online telemetry from production deployments in JetBrains IDEs. Mellums are released under the Apache-2.0 license on HuggingFace, with a public model card providing a reproducible reference for practitioners. Our experience offers a pragmatic blueprint for taking a focused, open model from a research prototype to at scale production for hundreds of thousands of users.
comment: 11 pages, 4 figures, 3 tables
☆ Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs
We generalize the concept of M\"obius inversion and Shapley values to directed acyclic multigraphs and weighted versions thereof. We further allow value functions (games) and thus their M\"obius transforms (synergy function) and Shapley values to have values in any abelian group that is a module over a ring that contains the graph weights, e.g. vector-valued functions. To achieve this and overcome the obstruction that the classical axioms (linearity, efficiency, null player, symmetry) are not strong enough to uniquely determine Shapley values in this more general setting, we analyze Shapley values from two novel points of view: 1) We introduce projection operators that allow us to interpret Shapley values as the recursive projection and re-attribution of higher-order synergies to lower-order ones; 2) we propose a strengthening of the null player axiom and a localized symmetry axiom, namely the weak elements and flat hierarchy axioms. The former allows us to remove coalitions with vanishing synergy while preserving the rest of the hierarchical structure. The latter treats player-coalition bonds uniformly in the corner case of hierarchically flat graphs. Together with linearity these axioms already imply a unique explicit formula for the Shapley values, as well as classical properties like efficiency, null player, symmetry, and novel ones like the projection property. This whole framework then specializes to finite inclusion algebras, lattices, partial orders and mereologies, and also recovers certain previously known cases as corner cases, and presents others from a new perspective. The admission of general weighted directed acyclic multigraph structured hierarchies and vector-valued functions and Shapley values opens up the possibility for new analytic tools and application areas, like machine learning, language processing, explainable artificial intelligence, and many more.
comment: 43 pages, 2 figures
☆ DP-SNP-TIHMM: Differentially Private, Time-Inhomogeneous Hidden Markov Models for Synthesizing Genome-Wide Association Datasets
Single nucleotide polymorphism (SNP) datasets are fundamental to genetic studies but pose significant privacy risks when shared. The correlation of SNPs with each other makes strong adversarial attacks such as masked-value reconstruction, kin, and membership inference attacks possible. Existing privacy-preserving approaches either apply differential privacy to statistical summaries of these datasets or offer complex methods that require post-processing and the usage of a publicly available dataset to suppress or selectively share SNPs. In this study, we introduce an innovative framework for generating synthetic SNP sequence datasets using samples derived from time-inhomogeneous hidden Markov models (TIHMMs). To preserve the privacy of the training data, we ensure that each SNP sequence contributes only a bounded influence during training, enabling strong differential privacy guarantees. Crucially, by operating on full SNP sequences and bounding their gradient contributions, our method directly addresses the privacy risks introduced by their inherent correlations. Through experiments conducted on the real-world 1000 Genomes dataset, we demonstrate the efficacy of our method using privacy budgets of $\varepsilon \in [1, 10]$ at $\delta=10^{-4}$. Notably, by allowing the transition models of the HMM to be dependent on the location in the sequence, we significantly enhance performance, enabling the synthetic datasets to closely replicate the statistical properties of non-private datasets. This framework facilitates the private sharing of genomic data while offering researchers exceptional flexibility and utility.
☆ Transcribing Rhythmic Patterns of the Guitar Track in Polyphonic Music
Whereas chord transcription has received considerable attention during the past couple of decades, far less work has been devoted to transcribing and encoding the rhythmic patterns that occur in a song. The topic is especially relevant for instruments such as the rhythm guitar, which is typically played by strumming rhythmic patterns that repeat and vary over time. However, in many cases one cannot objectively define a single "right" rhythmic pattern for a given song section. To create a dataset with well-defined ground-truth labels, we asked expert musicians to transcribe the rhythmic patterns in 410 popular songs and record cover versions where the guitar tracks followed those transcriptions. To transcribe the strums and their corresponding rhythmic patterns, we propose a three-step framework. Firstly, we perform approximate stem separation to extract the guitar part from the polyphonic mixture. Secondly, we detect individual strums within the separated guitar audio, using a pre-trained foundation model (MERT) as a backbone. Finally, we carry out a pattern-decoding process in which the transcribed sequence of guitar strums is represented by patterns drawn from an expert-curated vocabulary. We show that it is possible to transcribe the rhythmic patterns of the guitar track in polyphonic music with quite high accuracy, producing a representation that is human-readable and includes automatically detected bar lines and time signature markers. We perform ablation studies and error analysis and propose a set of evaluation metrics to assess the accuracy and readability of the predicted rhythmic pattern sequence.
comment: Accepted to WASPAA 2025
☆ Empirical Comparison of Membership Inference Attacks in Deep Transfer Learning
With the emergence of powerful large-scale foundation models, the training paradigm is increasingly shifting from from-scratch training to transfer learning. This enables high utility training with small, domain-specific datasets typical in sensitive applications.Membership inference attacks (MIAs) provide an empirical estimate of the privacy leakage by machine learning models. Yet, prior assessments of MIAs against models fine-tuned with transfer learning rely on a small subset of possible attacks. We address this by comparing performance of diverse MIAs in transfer learning settings to help practitioners identify the most efficient attacks for privacy risk evaluation. We find that attack efficacy decreases with the increase in training data for score-based MIAs. We find that there is no one MIA which captures all privacy risks in models trained with transfer learning. While the Likelihood Ratio Attack (LiRA) demonstrates superior performance across most experimental scenarios, the Inverse Hessian Attack (IHA) proves to be more effective against models fine-tuned on PatchCamelyon dataset in high data regime.
comment: 30 pages, 13 figures, published in TMLR https://openreview.net/forum?id=UligTUCgdt
☆ Uncertainty assessment in satellite-based greenhouse gas emissions estimates using emulated atmospheric transport
Monitoring greenhouse gas emissions and evaluating national inventories require efficient, scalable, and reliable inference methods. Top-down approaches, combined with recent advances in satellite observations, provide new opportunities to evaluate emissions at continental and global scales. However, transport models used in these methods remain a key source of uncertainty: they are computationally expensive to run at scale, and their uncertainty is difficult to characterise. Artificial intelligence offers a dual opportunity to accelerate transport simulations and to quantify their associated uncertainty. We present an ensemble-based pipeline for estimating atmospheric transport "footprints", greenhouse gas mole fraction measurements, and their uncertainties using a graph neural network emulator of a Lagrangian Particle Dispersion Model (LPDM). The approach is demonstrated with GOSAT (Greenhouse Gases Observing Satellite) observations for Brazil in 2016. The emulator achieved a ~1000x speed-up over the NAME LPDM, while reproducing large-scale footprint structures. Ensembles were calculated to quantify absolute and relative uncertainty, revealing spatial correlations with prediction error. The results show that ensemble spread highlights low-confidence spatial and temporal predictions for both atmospheric transport footprints and methane mole fractions. While demonstrated here for an LPDM emulator, the approach could be applied more generally to atmospheric transport models, supporting uncertainty-aware greenhouse gas inversion systems and improving the robustness of satellite-based emissions monitoring. With further development, ensemble-based emulators could also help explore systematic LPDM errors, offering a computationally efficient pathway towards a more comprehensive uncertainty budget in greenhouse gas flux estimates.
☆ Are Heterogeneous Graph Neural Networks Truly Effective? A Causal Perspective
Graph neural networks (GNNs) have achieved remarkable success in node classification. Building on this progress, heterogeneous graph neural networks (HGNNs) integrate relation types and node and edge semantics to leverage heterogeneous information. Causal analysis for HGNNs is advancing rapidly, aiming to separate genuine causal effects from spurious correlations. However, whether HGNNs are intrinsically effective remains underexamined, and most studies implicitly assume rather than establish this effectiveness. In this work, we examine HGNNs from two perspectives: model architecture and heterogeneous information. We conduct a systematic reproduction across 21 datasets and 20 baselines, complemented by comprehensive hyperparameter retuning. To further disentangle the source of performance gains, we develop a causal effect estimation framework that constructs and evaluates candidate factors under standard assumptions through factual and counterfactual analyses, with robustness validated via minimal sufficient adjustment sets, cross-method consistency checks, and sensitivity analyses. Our results lead to two conclusions. First, model architecture and complexity have no causal effect on performance. Second, heterogeneous information exerts a positive causal effect by increasing homophily and local-global distribution discrepancy, which makes node classes more distinguishable. The implementation is publicly available at https://github.com/YXNTU/CausalHGNN.
☆ Communication Enables Cooperation in LLM Agents: A Comparison with Curriculum-Based Approaches
Eliciting cooperation in multi-agent LLM systems is critical for AI alignment. We investigate two approaches: direct communication and curriculum learning. In a 4-player Stag Hunt, a one-word "cheap talk" channel increases cooperation from 0% to 48.3%, demonstrating communication as a robust coordination mechanism. In contrast, we find that curriculum learning is highly sensitive to design choices: our pedagogical curriculum through progressively complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game with Punishment. Qualitative analysis reveals that curricula emphasizing defection-equilibrium games can induce "learned pessimism" in agents. These findings suggest that for coordination problems, simple communication protocols may be more reliable than experience-based training, and that curriculum design for social dilemmas requires careful attention to the strategic lessons embedded in game sequences.
☆ ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.
comment: 29 pages, 2 figures
☆ Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
Masked diffusion models (MDMs) have recently emerged as a novel framework for language modeling. MDMs generate sentences by iteratively denoising masked sequences, filling in [MASK] tokens step by step. Although MDMs support any-order sampling, performance is highly sensitive to the choice of which position to unmask next. Prior work typically relies on rule-based schedules (e.g., max-confidence, max-margin), which provide ad hoc improvements. In contrast, we replace these heuristics with a learned scheduler. Specifically, we cast denoising as a KL-regularized Markov decision process (MDP) with an explicit reference policy and optimize a regularized objective that admits policy improvement and convergence guarantees under standard assumptions. We prove that the optimized policy under this framework generates samples that more closely match the data distribution than heuristic schedules. Empirically, across four benchmarks, our learned policy consistently outperforms max-confidence: for example, on SUDOKU, where unmasking order is critical, it yields a 20.1% gain over random and a 11.2% gain over max-confidence.
comment: Preprint
☆ Neighborhood-Adaptive Generalized Linear Graph Embedding with Latent Pattern Mining
Graph embedding has been widely applied in areas such as network analysis, social network mining, recommendation systems, and bioinformatics. However, current graph construction methods often require the prior definition of neighborhood size, limiting the effective revelation of potential structural correlations in the data. Additionally, graph embedding methods using linear projection heavily rely on a singular pattern mining approach, resulting in relative weaknesses in adapting to different scenarios. To address these challenges, we propose a novel model, Neighborhood-Adaptive Generalized Linear Graph Embedding (NGLGE), grounded in latent pattern mining. This model introduces an adaptive graph learning method tailored to the neighborhood, effectively revealing intrinsic data correlations. Simultaneously, leveraging a reconstructed low-rank representation and imposing $\ell_{2,0}$ norm constraint on the projection matrix allows for flexible exploration of additional pattern information. Besides, an efficient iterative solving algorithm is derived for the proposed model. Comparative evaluations on datasets from diverse scenarios demonstrate the superior performance of our model compared to state-of-the-art methods.
☆ DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities
Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.
☆ Stable Robot Motions on Manifolds: Learning Lyapunov-Constrained Neural Manifold ODEs
Learning stable dynamical systems from data is crucial for safe and reliable robot motion planning and control. However, extending stability guarantees to trajectories defined on Riemannian manifolds poses significant challenges due to the manifold's geometric constraints. To address this, we propose a general framework for learning stable dynamical systems on Riemannian manifolds using neural ordinary differential equations. Our method guarantees stability by projecting the neural vector field evolving on the manifold so that it strictly satisfies the Lyapunov stability criterion, ensuring stability at every system state. By leveraging a flexible neural parameterisation for both the base vector field and the Lyapunov function, our framework can accurately represent complex trajectories while respecting manifold constraints by evolving solutions directly on the manifold. We provide an efficient training strategy for applying our framework and demonstrate its utility by solving Riemannian LASA datasets on the unit quaternion (S^3) and symmetric positive-definite matrix manifolds, as well as robotic motions evolving on \mathbb{R}^3 \times S^3. We demonstrate the performance, scalability, and practical applicability of our approach through extensive simulations and by learning robot motions in a real-world experiment.
comment: 12 pages, 6 figures
☆ Primal-Dual Direct Preference Optimization for Constrained LLM Alignment
The widespread application of Large Language Models (LLMs) imposes increasing demands on safety, such as reducing harmful content and fake information, and avoiding certain forbidden tokens due to rules and laws. While there have been several recent works studying safe alignment of LLMs, these works either require the training of reward and cost models and incur high memory and computational costs, or need prior knowledge about the optimal solution. Motivated by this fact, we study the problem of constrained alignment in LLMs, i.e., maximizing the output reward while restricting the cost due to potentially unsafe content to stay below a threshold. For this problem, we propose a novel primal-dual DPO approach, which first trains a model using standard DPO on reward preference data to provide reward information, and then adopts a rearranged Lagrangian DPO objective utilizing the provided reward information to fine-tune LLMs on cost preference data. Our approach significantly reduces memory and computational costs, and does not require extra prior knowledge. Moreover, we establish rigorous theoretical guarantees on the suboptimality and constraint violation of the output policy. We also extend our approach to an online data setting by incorporating exploration bonuses, which enables our approach to explore uncovered prompt-response space, and then provide theoretical results that get rid of the dependence on preference data coverage. Experimental results on the widely-used preference dataset PKU-SafeRLHF demonstrate the effectiveness of our approach.
☆ Sparse deepfake detection promotes better disentanglement
Due to the rapid progress of speech synthesis, deepfake detection has become a major concern in the speech processing community. Because it is a critical task, systems must not only be efficient and robust, but also provide interpretable explanations. Among the different approaches for explainability, we focus on the interpretation of latent representations. In such paper, we focus on the last layer of embeddings of AASIST, a deepfake detection architecture. We use a TopK activation inspired by SAEs on this layer to obtain sparse representations which are used in the decision process. We demonstrate that sparse deepfake detection can improve detection performance, with an EER of 23.36% on ASVSpoof5 test set, with 95% of sparsity. We then show that these representations provide better disentanglement, using completeness and modularity metrics based on mutual information. Notably, some attacks are directly encoded in the latent space.
☆ Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies
A prevailing approach for learning visuomotor policies is to employ reinforcement learning to map high-dimensional visual observations directly to action commands. However, the combination of high-dimensional visual inputs and agile maneuver outputs leads to long-standing challenges, including low sample efficiency and significant sim-to-real gaps. To address these issues, we propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a novel framework designed to improve the sample efficiency and asymptotic performance of visuomotor policy learning. OMC-RL explicitly decouples the learning process into two stages: an upstream representation learning stage and a downstream policy learning stage. In the upstream stage, a masked Transformer module is trained with temporal modeling and contrastive learning to extract temporally-aware and task-relevant representations from sequential visual inputs. After training, the learned encoder is frozen and used to extract visual representations from consecutive frames, while the Transformer module is discarded. In the downstream stage, an oracle teacher policy with privileged access to global state information supervises the agent during early training to provide informative guidance and accelerate early policy learning. This guidance is gradually reduced to allow independent exploration as training progresses. Extensive experiments in simulated and real-world environments demonstrate that OMC-RL achieves superior sample efficiency and asymptotic policy performance, while also improving generalization across diverse and perceptually complex scenarios.
☆ vAttention: Verified Sparse Attention
State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(\epsilon, \delta)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.
☆ QGraphLIME - Explaining Quantum Graph Neural Networks
Quantum graph neural networks offer a powerful paradigm for learning on graph-structured data, yet their explainability is complicated by measurement-induced stochasticity and the combinatorial nature of graph structure. In this paper, we introduce QuantumGraphLIME (QGraphLIME), a model-agnostic, post-hoc framework that treats model explanations as distributions over local surrogates fit on structure-preserving perturbations of a graph. By aggregating surrogate attributions together with their dispersion, QGraphLIME yields uncertainty-aware node and edge importance rankings for quantum graph models. The framework further provides a distribution-free, finite-sample guarantee on the size of the surrogate ensemble: a Dvoretzky-Kiefer-Wolfowitz bound ensures uniform approximation of the induced distribution of a binary class probability at target accuracy and confidence under standard independence assumptions. Empirical studies on controlled synthetic graphs with known ground truth demonstrate accurate and stable explanations, with ablations showing clear benefits of nonlinear surrogate modeling and highlighting sensitivity to perturbation design. Collectively, these results establish a principled, uncertainty-aware, and structure-sensitive approach to explaining quantum graph neural networks, and lay the groundwork for scaling to broader architectures and real-world datasets, as quantum resources mature. Code is available at https://github.com/smlab-niser/qglime.
☆ Verifier-free Test-Time Sampling for Vision Language Action Models
Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the model's internal properties without requiring additional training or external modules. Our approach utilizes KL divergence from a reference action token distribution as a confidence metric for selecting the optimal action from multiple candidates. We introduce a reference distribution generated by the same VLA but with randomly masked states and language conditions as inputs, ensuring maximum uncertainty while remaining aligned with the target task distribution. Additionally, we propose a joint training strategy that enables the model to learn both conditional and unconditional distributions by applying dropout to state and language conditions, thereby further improving the quality of the reference distribution. Our experiments demonstrate that MG-Select achieves significant performance improvements, including a 28%/35% improvement in real-world in-distribution/out-of-distribution tasks, along with a 168% relative gain on RoboCasa pick-and-place tasks trained with 30 demonstrations.
comment: 14 pages; 3 figures
☆ Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection
Graph-based methods are becoming increasingly popular in machine learning due to their ability to model complex data and relations. Insurance fraud is a prime use case, since false claims are often the result of organised criminals that stage accidents or the same persons filing erroneous claims on multiple policies. One challenge is that graph-based approaches struggle to find meaningful representations of the data because of the high class imbalance present in fraud data. Another is that insurance networks are heterogeneous and dynamic, given the changing relations among people, companies and policies. That is why gradient boosted tree approaches on tabular data still dominate the field. Therefore, we present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. We show that our estimator competes with popular graph neural network approaches in an experiment using a variety of simulated random graphs. We demonstrate the power of G-GBM for insurance fraud detection using an open-source and a real-world, proprietary dataset. Given that the backbone model is a gradient boosting forest, we apply established explainability methods to gain better insights into the predictions made by G-GBM.
Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models
Concept Bottleneck Models (CBNMs) are deep learning models that provide interpretability by enforcing a bottleneck layer where predictions are based exclusively on human-understandable concepts. However, this constraint also restricts information flow and often results in reduced predictive accuracy. Concept Sidechannel Models (CSMs) address this limitation by introducing a sidechannel that bypasses the bottleneck and carry additional task-relevant information. While this improves accuracy, it simultaneously compromises interpretability, as predictions may rely on uninterpretable representations transmitted through sidechannels. Currently, there exists no principled technique to control this fundamental trade-off. In this paper, we close this gap. First, we present a unified probabilistic concept sidechannel meta-model that subsumes existing CSMs as special cases. Building on this framework, we introduce the Sidechannel Independence Score (SIS), a metric that quantifies a CSM's reliance on its sidechannel by contrasting predictions made with and without sidechannel information. We propose SIS regularization, which explicitly penalizes sidechannel reliance to improve interpretability. Finally, we analyze how the expressivity of the predictor and the reliance of the sidechannel jointly shape interpretability, revealing inherent trade-offs across different CSM architectures. Empirical results show that state-of-the-art CSMs, when trained solely for accuracy, exhibit low representation interpretability, and that SIS regularization substantially improves their interpretability, intervenability, and the quality of learned interpretable task predictors. Our work provides both theoretical and practical tools for developing CSMs that balance accuracy and interpretability in a principled manner.
☆ NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.
☆ From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei NPU, Graphcore IPU, and Cerebras WSE, etc. Most of these accelerators adopt multi-core architectures to achieve enhanced scalability, but lack the flexibility of SIMT architectures. Therefore, without careful configuration of the hardware architecture, as well as deliberate design of tensor parallelism and core placement strategies, computational resources may be underutilized, resulting in suboptimal inference performance. To address these challenges, we first present a multi-level simulation framework with both transaction-level and performance-model-based simulation for multi-core NPUs. Using this simulator, we conduct a systematic analysis and further propose the optimal solutions for tensor parallelism strategies, core placement policies, memory management methods, as well as the selection between PD-disaggregation and PD-fusion on multi-core NPUs. We conduct comprehensive experiments on representative LLMs and various NPU configurations. The evaluation results demonstrate that, our solution can achieve 1.32x-6.03x speedup compared to SOTA designs for multi-core NPUs across different hardware configurations. As for LLM serving, our work offers guidance on designing optimal hardware architectures and serving strategies for multi-core NPUs across various LLM workloads.
☆ Monte Carlo-Type Neural Operator for Differential Equations
The Monte Carlo-type Neural Operator (MCNO) introduces a framework for learning solution operators of one-dimensional partial differential equations (PDEs) by directly learning the kernel function and approximating the associated integral operator using a Monte Carlo-type approach. Unlike Fourier Neural Operators (FNOs), which rely on spectral representations and assume translation-invariant kernels, MCNO makes no such assumptions. The kernel is represented as a learnable tensor over sampled input-output pairs, and sampling is performed once, uniformly at random from a discretized grid. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training, while an interpolation step maps between arbitrary input and output grids to further enhance flexibility. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with efficient computational cost. We also provide a theoretical analysis proving that the Monte Carlo estimator yields a bounded bias and variance under mild regularity assumptions. This result holds in any spatial dimension, suggesting that MCNO may extend naturally beyond one-dimensional problems. More broadly, this work explores how Monte Carlo-type integration can be incorporated into neural operator frameworks for continuous-domain PDEs, providing a theoretically supported alternative to spectral methods (such as FNO) and to graph-based Monte Carlo approaches (such as the Graph Kernel Neural Operator, GNO).
☆ InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment
Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git
☆ PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias emphasizes short-range continuity but undermines the model's capacity to capture long-range dependencies, hindering its ability to enforce global structural properties such as symmetry, consistent topology, and large-scale geometric regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. In addition, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, in dense generation with 8,192 points, PointNSP's advantages become even more pronounced, underscoring its scalability potential.
☆ Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning
Fundamental limits to predictability are central to our understanding of many physical and computational systems. Here we show that, despite its remarkable capabilities, deep learning exhibits such fundamental limits rooted in the fractal, riddled geometry of its basins of attraction: any initialization that leads to one solution lies arbitrarily close to another that leads to a different one. We derive sufficient conditions for the emergence of riddled basins by analytically linking features widely observed in deep learning, including chaotic learning dynamics and symmetry-induced invariant subspaces, to reveal a general route to riddling in realistic deep networks. The resulting basins of attraction possess an infinitely fine-scale fractal structure characterized by an uncertainty exponent near zero, so that even large increases in the precision of initial conditions yield only marginal gains in outcome predictability. Riddling thus imposes a fundamental limit on the predictability and hence reproducibility of neural network training, providing a unified account of many empirical observations. These results reveal a general organizing principle of deep learning with important implications for optimization and the safe deployment of artificial intelligence.
☆ In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.
comment: 45 pages, 12 figures. Project website: https://agentflow.stanford.edu/
☆ Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
The proliferation of mobile devices generates a massive volume of time series across various domains, where effective time series forecasting enables a variety of real-world applications. This study focuses on a new problem of source-free domain adaptation for time series forecasting. It aims to adapt a pretrained model from sufficient source time series to the sparse target time series domain without access to the source data, embracing data protection regulations. To achieve this, we propose TimePD, the first source-free time series forecasting framework with proxy denoising, where large language models (LLMs) are employed to benefit from their generalization capabilities. Specifically, TimePD consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of LLMs; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets offer insight into the effectiveness of the proposed TimePD, outperforming SOTA baselines by 9.3% on average.
☆ When Does Global Attention Help? A Unified Empirical Study on Atomistic Graph Learning
Graph neural networks (GNNs) are widely used as surrogates for costly experiments and first-principles simulations to study the behavior of compounds at atomistic scale, and their architectural complexity is constantly increasing to enable the modeling of complex physics. While most recent GNNs combine more traditional message passing neural networks (MPNNs) layers to model short-range interactions with more advanced graph transformers (GTs) with global attention mechanisms to model long-range interactions, it is still unclear when global attention mechanisms provide real benefits over well-tuned MPNN layers due to inconsistent implementations, features, or hyperparameter tuning. We introduce the first unified, reproducible benchmarking framework - built on HydraGNN - that enables seamless switching among four controlled model classes: MPNN, MPNN with chemistry/topology encoders, GPS-style hybrids of MPNN with global attention, and fully fused local - global models with encoders. Using seven diverse open-source datasets for benchmarking across regression and classification tasks, we systematically isolate the contributions of message passing, global attention, and encoder-based feature augmentation. Our study shows that encoder-augmented MPNNs form a robust baseline, while fused local-global models yield the clearest benefits for properties governed by long-range interaction effects. We further quantify the accuracy - compute trade-offs of attention, reporting its overhead in memory. Together, these results establish the first controlled evaluation of global attention in atomistic graph learning and provide a reproducible testbed for future model development.
comment: 40 pages, 8 figures, 18 tables
☆ (Token-Level) \textbf{InfoRMIA}: Stronger Membership Inference and Memorization Assessment for LLMs
Machine learning models are known to leak sensitive information, as they inevitably memorize (parts of) their training data. More alarmingly, large language models (LLMs) are now trained on nearly all available data, which amplifies the magnitude of information leakage and raises serious privacy risks. Hence, it is more crucial than ever to quantify privacy risk before the release of LLMs. The standard method to quantify privacy is via membership inference attacks, where the state-of-the-art approach is the Robust Membership Inference Attack (RMIA). In this paper, we present InfoRMIA, a principled information-theoretic formulation of membership inference. Our method consistently outperforms RMIA across benchmarks while also offering improved computational efficiency. In the second part of the paper, we identify the limitations of treating sequence-level membership inference as the gold standard for measuring leakage. We propose a new perspective for studying membership and memorization in LLMs: token-level signals and analyses. We show that a simple token-based InfoRMIA can pinpoint which tokens are memorized within generated outputs, thereby localizing leakage from the sequence level down to individual tokens, while achieving stronger sequence-level inference power on LLMs. This new scope rethinks privacy in LLMs and can lead to more targeted mitigation, such as exact unlearning.
☆ Power Mechanism: Private Tabular Representation Release for Model Agnostic Consumption
Traditional collaborative learning approaches are based on sharing of model weights between clients and a server. However, there are advantages to resource efficiency through schemes based on sharing of embeddings (activations) created from the data. Several differentially private methods were developed for sharing of weights while such mechanisms do not exist so far for sharing of embeddings. We propose Ours to learn a privacy encoding network in conjunction with a small utility generation network such that the final embeddings generated from it are equipped with formal differential privacy guarantees. These privatized embeddings are then shared with a more powerful server, that learns a post-processing that results in a higher accuracy for machine learning tasks. We show that our co-design of collaborative and private learning results in requiring only one round of privatized communication and lesser compute on the client than traditional methods. The privatized embeddings that we share from the client are agnostic to the type of model (deep learning, random forests or XGBoost) used on the server in order to process these activations to complete a task.
☆ On the Theory of Continual Learning with Gradient Descent for Neural Networks
Continual learning, the ability of a model to adapt to an ongoing sequence of tasks without forgetting the earlier ones, is a central goal of artificial intelligence. To shed light on its underlying mechanisms, we analyze the limitations of continual learning in a tractable yet representative setting. In particular, we study one-hidden-layer quadratic neural networks trained by gradient descent on an XOR cluster dataset with Gaussian noise, where different tasks correspond to different clusters with orthogonal means. Our results obtain bounds on the rate of forgetting during train and test-time in terms of the number of iterations, the sample size, the number of tasks, and the hidden-layer size. Our results reveal interesting phenomena on the role of different problem parameters in the rate of forgetting. Numerical experiments across diverse setups confirm our results, demonstrating their validity beyond the analyzed settings.
Efficient Learning-based Graph Simulation for Temporal Graphs ICDE 2025
Graph simulation has recently received a surge of attention in graph processing and analytics. In real-life applications, e.g. social science, biology, and chemistry, many graphs are composed of a series of evolving graphs (i.e., temporal graphs). While most of the existing graph generators focus on static graphs, the temporal information of the graphs is ignored. In this paper, we focus on simulating temporal graphs, which aim to reproduce the structural and temporal properties of the observed real-life temporal graphs. In this paper, we first give an overview of the existing temporal graph generators, including recently emerged learning-based approaches. Most of these learning-based methods suffer from one of the limitations: low efficiency in training or slow generating, especially for temporal random walk-based methods. Therefore, we propose an efficient learning-based approach to generate graph snapshots, namely temporal graph autoencoder (TGAE). Specifically, we propose an attention-based graph encoder to encode temporal and structural characteristics on sampled ego-graphs. And we proposed an ego-graph decoder that can achieve a good trade-off between simulation quality and efficiency in temporal graph generation. Finally, the experimental evaluation is conducted among our proposed TGAE and representative temporal graph generators on real-life temporal graphs and synthesized graphs. It is reported that our proposed approach outperforms the state-of-the-art temporal graph generators by means of simulation quality and efficiency.
comment: 14 pages, 6 figures, IEEE ICDE 2025
☆ Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processes
Methods for solving scientific computing and inference problems, such as kernel- and neural network-based approaches for partial differential equations (PDEs), inverse problems, and supervised learning tasks, depend crucially on the choice of hyperparameters. Specifically, the efficacy of such methods, and in particular their accuracy, stability, and generalization properties, strongly depends on the choice of hyperparameters. While bilevel optimization offers a principled framework for hyperparameter tuning, its nested optimization structure can be computationally demanding, especially in PDE-constrained contexts. In this paper, we propose an efficient strategy for hyperparameter optimization within the bilevel framework by employing a Gauss-Newton linearization of the inner optimization step. Our approach provides closed-form updates, eliminating the need for repeated costly PDE solves. As a result, each iteration of the outer loop reduces to a single linearized PDE solve, followed by explicit gradient-based hyperparameter updates. We demonstrate the effectiveness of the proposed method through Gaussian process models applied to nonlinear PDEs and to PDE inverse problems. Extensive numerical experiments highlight substantial improvements in accuracy and robustness compared to conventional random hyperparameter initialization. In particular, experiments with additive kernels and neural network-parameterized deep kernels demonstrate the method's scalability and effectiveness for high-dimensional hyperparameter optimization.
☆ Domain-Shift-Aware Conformal Prediction for Large Language Models
Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.
comment: 26 pages
☆ Generative Dynamic Graph Representation Learning for Conspiracy Spoofing Detection
Spoofing detection in financial trading is crucial, especially for identifying complex behaviors such as conspiracy spoofing. Traditional machine-learning approaches primarily focus on isolated node features, often overlooking the broader context of interconnected nodes. Graph-based techniques, particularly Graph Neural Networks (GNNs), have advanced the field by leveraging relational information effectively. However, in real-world spoofing detection datasets, trading behaviors exhibit dynamic, irregular patterns. Existing spoofing detection methods, though effective in some scenarios, struggle to capture the complexity of dynamic and diverse, evolving inter-node relationships. To address these challenges, we propose a novel framework called the Generative Dynamic Graph Model (GDGM), which models dynamic trading behaviors and the relationships among nodes to learn representations for conspiracy spoofing detection. Specifically, our approach incorporates the generative dynamic latent space to capture the temporal patterns and evolving market conditions. Raw trading data is first converted into time-stamped sequences. Then we model trading behaviors using the neural ordinary differential equations and gated recurrent units, to generate the representation incorporating temporal dynamics of spoofing patterns. Furthermore, pseudo-label generation and heterogeneous aggregation techniques are employed to gather relevant information and enhance the detection performance for conspiratorial spoofing behaviors. Experiments conducted on spoofing detection datasets demonstrate that our approach outperforms state-of-the-art models in detection accuracy. Additionally, our spoofing detection system has been successfully deployed in one of the largest global trading markets, further validating the practical applicability and performance of the proposed method.
comment: 10 pages, 5 figures, ACM the web conference 2025
☆ Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.
comment: Project page: https://agenticlearning.ai/midway-network/
☆ Critical attention scaling in long-context transformers
As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.
comment: 29 pages, 2 figures
☆ Channel Simulation and Distributed Compression with Ensemble Rejection Sampling
We study channel simulation and distributed matching, two fundamental problems with several applications to machine learning, using a recently introduced generalization of the standard rejection sampling (RS) algorithm known as Ensemble Rejection Sampling (ERS). For channel simulation, we propose a new coding scheme based on ERS that achieves a near-optimal coding rate. In this process, we demonstrate that standard RS can also achieve a near-optimal coding rate and generalize the result of Braverman and Garg (2014) to the continuous alphabet setting. Next, as our main contribution, we present a distributed matching lemma for ERS, which serves as the rejection sampling counterpart to the Poisson Matching Lemma (PML) introduced by Li and Anantharam (2021). Our result also generalizes a recent work on importance matching lemma (Phan et al, 2024) and, to our knowledge, is the first result on distributed matching in the family of rejection sampling schemes where the matching probability is close to PML. We demonstrate the practical significance of our approach over prior works by applying it to distributed compression. The effectiveness of our proposed scheme is validated through experiments involving synthetic Gaussian sources and distributed image compression using the MNIST dataset.
☆ Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.
☆ Permutation-Invariant Representation Learning for Robust and Privacy-Preserving Feature Selection
Feature selection eliminates redundancy among features to improve downstream task performance while reducing computational overhead. Existing methods often struggle to capture intricate feature interactions and adapt across diverse application scenarios. Recent advances employ generative intelligence to alleviate these drawbacks. However, these methods remain constrained by permutation sensitivity in embedding and reliance on convexity assumptions in gradient-based search. To address these limitations, our initial work introduces a novel framework that integrates permutation-invariant embedding with policy-guided search. Although effective, it still left opportunities to adapt to realistic distributed scenarios. In practice, data across local clients is highly imbalanced, heterogeneous and constrained by strict privacy regulations, limiting direct sharing. These challenges highlight the need for a framework that can integrate feature selection knowledge across clients without exposing sensitive information. In this extended journal version, we advance the framework from two perspectives: 1) developing a privacy-preserving knowledge fusion strategy to derive a unified representation space without sharing sensitive raw data. 2) incorporating a sample-aware weighting strategy to address distributional imbalance among heterogeneous local clients. Extensive experiments validate the effectiveness, robustness, and efficiency of our framework. The results further demonstrate its strong generalization ability in federated learning scenarios. The code and data are publicly available: https://anonymous.4open.science/r/FedCAPS-08BF.
☆ Teamwork: Collaborative Diffusion with Low-rank Coordination and Adaptation
Large pretrained diffusion models can provide strong priors beneficial for many graphics applications. However, generative applications such as neural rendering and inverse methods such as SVBRDF estimation and intrinsic image decomposition require additional input or output channels. Current solutions for channel expansion are often application specific and these solutions can be difficult to adapt to different diffusion models or new tasks. This paper introduces Teamwork: a flexible and efficient unified solution for jointly increasing the number of input and output channels as well as adapting a pretrained diffusion model to new tasks. Teamwork achieves channel expansion without altering the pretrained diffusion model architecture by coordinating and adapting multiple instances of the base diffusion model (\ie, teammates). We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address both adaptation and coordination between the different teammates. Furthermore Teamwork supports dynamic (de)activation of teammates. We demonstrate the flexibility and efficiency of Teamwork on a variety of generative and inverse graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic decomposition, neural shading, and intrinsic image synthesis.
Efficient learning of bosonic Gaussian unitaries
Bosonic Gaussian unitaries are fundamental building blocks of central continuous-variable quantum technologies such as quantum-optic interferometry and bosonic error-correction schemes. In this work, we present the first time-efficient algorithm for learning bosonic Gaussian unitaries with a rigorous analysis. Our algorithm produces an estimate of the unknown unitary that is accurate to small worst-case error, measured by the physically motivated energy-constrained diamond distance. Its runtime and query complexity scale polynomially with the number of modes, the inverse target accuracy, and natural energy parameters quantifying the allowed input energy and the unitary's output-energy growth. The protocol uses only experimentally friendly photonic resources: coherent and squeezed probes, passive linear optics, and heterodyne/homodyne detection. We then employ an efficient classical post-processing routine that leverages a symplectic regularization step to project matrix estimates onto the symplectic group. In the limit of unbounded input energy, our procedure attains arbitrarily high precision using only $2m+2$ queries, where $m$ is the number of modes. To our knowledge, this is the first provably efficient learning algorithm for a multiparameter family of continuous-variable unitaries.
☆ LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability
Test-time adaptation (TTA) aims to adapt a pretrained model to distribution shifts using only unlabeled test data. While promising, existing methods like Tent suffer from instability and can catastrophically forget the source knowledge, especially with small batch sizes or challenging corruptions. We argue that this arises from overly deterministic updates on a complex loss surface. In this paper, we introduce Langevin-Anchored Test-Time Adaptation (LATTA), a novel approach that regularizes adaptation through two key mechanisms: (1) a noisy weight perturbation inspired by Stochastic Gradient Langevin Dynamics (SGLD) to explore the local parameter space and escape poor local minima, and (2) a stable weight anchor that prevents the model from diverging from its robust source pre-training. This combination allows LATTA to adapt effectively without sacrificing stability. Unlike prior Bayesian TTA methods, LATTA requires no architectural changes or expensive Monte Carlo passes. We conduct extensive experiments on standard benchmarks, including Rotated-MNIST and the more challenging CIFAR-10-C. Our results demonstrate that LATTA significantly outperforms existing methods, including Tent, CoTTA, and EATA, setting a new state of the art for self-supervised TTA by improving average accuracy on CIFAR-10-C by over 2% while simultaneously reducing performance variance.
comment: MIT URTC 2025 Technical Paper (Oral), 5 pages, 3 figures
☆ H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
Autoregressive decoding in large language models (LLMs) requires caching a growing list of past key-value (KV) pairs, making long-context inference a memory-bound problem. While recent methods have explored quantizing the cache, evicting tokens, or using binary sketches for keys (e.g., Loki), these approaches often provide an incomplete solution by leaving one component (like values) uncompressed or by discarding context information. This paper introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression scheme that radically reduces memory usage without sacrificing context. H1B-KV represents each key vector using a 1-bit binary sketch, enabling hardware-friendly bitwise attention, and further compresses value vectors using 4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches full-precision performance not only on perplexity benchmarks but also on complex downstream tasks like mathematical reasoning (GSM8K), multi-task understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV significantly outperforms leading quantization (KIVI), token eviction (SparseLLM), and key-only sketching (Loki) methods in quality-per-byte, establishing it as a robust solution for deploying LLMs in memory-constrained environments.
comment: MIT URTC 2025 Technical Paper (Oral), 5 pages, 1 figure
☆ ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy
☆ Transfer Learning on Edge Connecting Probability Estimation under Graphon Model
Graphon models provide a flexible nonparametric framework for estimating latent connectivity probabilities in networks, enabling a range of downstream applications such as link prediction and data augmentation. However, accurate graphon estimation typically requires a large graph, whereas in practice, one often only observes a small-sized network. One approach to addressing this issue is to adopt a transfer learning framework, which aims to improve estimation in a small target graph by leveraging structural information from a larger, related source graph. In this paper, we propose a novel method, namely GTRANS, a transfer learning framework that integrates neighborhood smoothing and Gromov-Wasserstein optimal transport to align and transfer structural patterns between graphs. To prevent negative transfer, GTRANS includes an adaptive debiasing mechanism that identifies and corrects for target-specific deviations via residual smoothing. We provide theoretical guarantees on the stability of the estimated alignment matrix and demonstrate the effectiveness of GTRANS in improving the accuracy of target graph estimation through extensive synthetic and real data experiments. These improvements translate directly to enhanced performance in downstream applications, such as the graph classification task and the link prediction task.
☆ Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models (LLM) with human preference. However, the quality of RLHF and DPO training is seriously compromised by \textit{\textbf{C}orrupted} preference, reward \textit{\textbf{O}veroptimization}, and bias towards \textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only one of these important issues, and the few other works require much computation to estimate multiple reward models and lack theoretical guarantee of generalization ability. In this work, we propose RLHF-\textbf{COV} and DPO-\textbf{COV} algorithms that can simultaneously mitigate these three issues, in both offline and online settings. This ability is theoretically demonstrated by obtaining length-regularized generalization error rates for our DPO-COV algorithms trained on corrupted data, which match the best-known rates for simpler cases with clean data and without length regularization. Moreover, our DPO-COV algorithm is simple to implement without reward estimation, and is proved to be equivalent to our RLHF-COV algorithm, which directly implies the equivalence between the vanilla RLHF and DPO algorithms. Experiments demonstrate the effectiveness of our DPO-COV algorithms under both offline and online settings.
☆ NeST-BO: Fast Local Bayesian Optimization via Newton-Step Targeting of Gradient and Hessian Information
Bayesian optimization (BO) is effective for expensive black-box problems but remains challenging in high dimensions. We propose NeST-BO, a local BO method that targets the Newton step by jointly learning gradient and Hessian information with Gaussian process surrogates, and selecting evaluations via a one-step lookahead bound on Newton-step error. We show that this bound (and hence the step error) contracts with batch size, so NeST-BO directly inherits inexact-Newton convergence: global progress under mild stability assumptions and quadratic local rates once steps are sufficiently accurate. To scale, we optimize the acquisition in low-dimensional subspaces (e.g., random embeddings or learned sparse subspaces), reducing the dominant cost of learning curvature from $O(d^2)$ to $O(m^2)$ with $m \ll d$ while preserving step targeting. Across high-dimensional synthetic and real-world problems, including cases with thousands of variables and unknown active subspaces, NeST-BO consistently yields faster convergence and lower regret than state-of-the-art local and high-dimensional BO baselines.
☆ EEG-Based Acute Pain Classification: Machine Learning Model Comparison and Real-Time Clinical Feasibility
Current pain assessment within hospitals often relies on self-reporting or non-specific EKG vital signs. This system leaves critically ill, sedated, and cognitively impaired patients vulnerable to undertreated pain and opioid overuse. Electroencephalography (EEG) offers a noninvasive method of measuring brain activity. This technology could potentially be applied as an assistive tool to highlight nociceptive processing in order to mitigate this issue. In this study, we compared machine learning models for classifying high-pain versus low/no-pain EEG epochs using data from fifty-two healthy adults exposed to laser-evoked pain at three intensities (low, medium, high). Each four-second epoch was transformed into a 537-feature vector spanning spectral power, band ratios, Hjorth parameters, entropy measures, coherence, wavelet energies, and peak-frequency metrics. Nine traditional machine learning models were evaluated with leave-one-participant-out cross-validation. A support vector machine with radial basis function kernel achieved the best offline performance with 88.9% accuracy and sub-millisecond inference time (1.02 ms). Our Feature importance analysis was consistent with current canonical pain physiology, showing contralateral alpha suppression, midline theta/alpha enhancement, and frontal gamma bursts. The real-time XGBoost model maintained an end-to-end latency of about 4 ms and 94.2% accuracy, demonstrating that an EEG-based pain monitor is technically feasible within a clinical setting and provides a pathway towards clinical validation.
☆ Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting
Large Language Models (LLMs) with Mixture of Experts (MoE) architectures achieve remarkable performance improvements, but their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit serving systems. To forecast the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across three state-of-the-art large-scale MoE models (200B- 671B) using over 24,000 requests spanning diverse workloads. With the resulting 150GB+ trace files, we perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. Taking wafer-scale GPUs as a case study, we demonstrate that minor architectural modifications leveraging our insights achieve substantial performance gains, delivering 6.3X and 4.0X average speedups on DeepSeek V3 and Qwen3, respectively. Our work provides the first comprehensive data-centric analysis of MoE models at scale. Our profiling traces and analysis results are publicly available at {https://huggingface.co/datasets/core12345/MoE_expert_selection_trace. We will also release our simulation framework shortly to facilitate future research in this area.
☆ Fundamental Limits of Crystalline Equivariant Graph Neural Networks: A Circuit Complexity Perspective
Graph neural networks (GNNs) have become a core paradigm for learning on relational data. In materials science, equivariant GNNs (EGNNs) have emerged as a compelling backbone for crystalline-structure prediction, owing to their ability to respect Euclidean symmetries and periodic boundary conditions. Despite strong empirical performance, their expressive power in periodic, symmetry-constrained settings remains poorly understood. This work characterizes the intrinsic computational and expressive limits of EGNNs for crystalline-structure prediction through a circuit-complexity lens. We analyze the computations carried out by EGNN layers acting on node features, atomic coordinates, and lattice matrices, and prove that, under polynomial precision, embedding width $d=O(n)$ for $n$ nodes, $O(1)$ layers, and $O(1)$-depth, $O(n)$-width MLP instantiations of the message/update/readout maps, these models admit a simulation by a uniform $\mathsf{TC}^0$ threshold-circuit family of polynomial size (with an explicit constant-depth bound). Situating EGNNs within $\mathsf{TC}^0$ provides a concrete ceiling on the decision and prediction problems solvable by such architectures under realistic resource constraints and clarifies which architectural modifications (e.g., increased depth, richer geometric primitives, or wider layers) are required to transcend this regime. The analysis complements Weisfeiler-Lehman style results that do not directly transfer to periodic crystals, and offers a complexity-theoretic foundation for symmetry-aware graph learning on crystalline systems.
☆ High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training
The development of machine learning for cardiac care is severely hampered by privacy restrictions on sharing real patient electrocardiogram (ECG) data. Although generative AI offers a promising solution, the real-world use of existing model-synthesized ECGs is limited by persistent gaps in trustworthiness and clinical utility. In this work, we address two major shortcomings of current generative ECG methods: insufficient morphological fidelity and the inability to generate personalized, patient-specific physiological signals. To address these gaps, we build on a conditional diffusion-based Structured State Space Model (SSSD-ECG) with two principled innovations: (1) MIDT-ECG (Mel-Spectrogram Informed Diffusion Training), a novel training paradigm with time-frequency domain supervision to enforce physiological structural realism, and (2) multi-modal demographic conditioning to enable patient-specific synthesis. We comprehensively evaluate our approach on the PTB-XL dataset, assessing the synthesized ECG signals on fidelity, clinical coherence, privacy preservation, and downstream task utility. MIDT-ECG achieves substantial gains: it improves morphological coherence, preserves strong privacy guarantees with all metrics evaluated exceeding the baseline by 4-8%, and notably reduces the interlead correlation error by an average of 74%, while demographic conditioning enhances signal-to-noise ratio and personalization. In critical low-data regimes, a classifier trained on datasets supplemented with our synthetic ECGs achieves performance comparable to a classifier trained solely on real data. Together, we demonstrate that ECG synthesizers, trained with the proposed time-frequency structural regularization scheme, can serve as personalized, high-fidelity, privacy-preserving surrogates when real data are scarce, advancing the responsible use of generative AI in healthcare.
☆ NorMuon: Making Muon more efficient and scalable
The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon's emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.
☆ The Method of Infinite Descent
Training - the optimisation of complex models - is traditionally performed through small, local, iterative updates [D. E. Rumelhart, G. E. Hinton, R. J. Williams, Nature 323, 533-536 (1986)]. Approximating solutions through truncated gradients is a paradigm dating back to Cauchy [A.-L. Cauchy, Comptes Rendus Math\'ematique 25, 536-538 (1847)] and Newton [I. Newton, The Method of Fluxions and Infinite Series (Henry Woodfall, London, 1736)]. This work introduces the Method of Infinite Descent, a semi-analytic optimisation paradigm that reformulates training as the direct solution to the first-order optimality condition. By analytical resummation of its Taylor expansion, this method yields an exact, algebraic equation for the update step. Realisation of the infinite Taylor tower's cascading resummation is formally derived, and an exploitative algorithm for the direct solve step is proposed. This principle is demonstrated with the herein-introduced AION (Analytic, Infinitely-Optimisable Network) architecture. AION is a model designed expressly to satisfy the algebraic closure required by Infinite Descent. In a simple test problem, AION reaches the optimum in a single descent step. Together, this optimiser-model pair exemplify how analytic structure enables exact, non-iterative convergence. Infinite Descent extends beyond this example, applying to any appropriately closed architecture. This suggests a new class of semi-analytically optimisable models: the \emph{Infinity Class}; sufficient conditions for class membership are discussed. This offers a pathway toward non-iterative learning.
☆ TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation
Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{torch.unique}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.
comment: 9 pages, 3 figures
☆ ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics
Molecular dynamics (MD) simulations underpin modern computational drug dis- covery, materials science, and biochemistry. Recent machine learning models provide high-fidelity MD predictions without the need to repeatedly solve quantum mechanical forces, enabling significant speedups over conventional pipelines. Yet many such methods typically enforce strict equivariance and rely on sequential rollouts, thus limiting their flexibility and simulation efficiency. They are also com- monly single-task, trained on individual molecules and fixed timeframes, which restricts generalization to unseen compounds and extended timesteps. To address these issues, we propose Atomistic Transformer Operator for Molecules (ATOM), a pretrained transformer neural operator for multitask molecular dynamics. ATOM adopts a quasi-equivariant design that requires no explicit molecular graph and employs a temporal attention mechanism, allowing for the accurate parallel decod- ing of multiple future states. To support operator pretraining across chemicals and timescales, we curate TG80, a large, diverse, and numerically stable MD dataset with over 2.5 million femtoseconds of trajectories across 80 compounds. ATOM achieves state-of-the-art performance on established single-task benchmarks, such as MD17, RMD17 and MD22. After multitask pretraining on TG80, ATOM shows exceptional zero-shot generalization to unseen molecules across varying time hori- zons. We believe ATOM represents a significant step toward accurate, efficient, and transferable molecular dynamics models
☆ AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning
Large Language Models (LLMs) are scaling rapidly, creating significant challenges for collaborative server client distributed training, particularly in terms of communication efficiency and computational overheads. To address these challenges, we implement Parameter-efficient Split Learning, which effectively balances efficiency and performance for collaborative training on low-resource devices. To reduce communication overhead in collaborative training, we introduce Adaptive Mixed bit Activation Quantization (AMAQ), a strategy that progressively compresses activations and gradients from high precision (6 to 8 bits) to low precision (3 to 4 bits). AMAQ achieves this by effectively allocating bit budgets across channels based on feature wise and layer wise importance using bit regularization. Under the same bit budgets, AMAQ outperforms fixed-precision approaches, delivering about 2.5% higher generation accuracy and about 1.3% better classification accuracy for models like LLaMA3 8B and Qwen2.5 7B. In addition, it significantly enhances training stability and reducing ultra-low bit representation collapse during the training. Experiments demonstrate that AMAQ integrates effectively into practical multi-machine collaborative training setups, offering superior inference accuracy with only a modest communication overhead for bits adaptation during training. This trade off makes AMAQ a practical and effective solution for collaborative training with minimal communication cost.
comment: 14 pages
☆ Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security NeurIPS 2025
Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.
comment: Accepted at Lock-LLM Workshop, NeurIPS 2025
☆ Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy
Online matching problems arise in many complex systems, from cloud services and online marketplaces to organ exchange networks, where timely, principled decisions are critical for maintaining high system performance. Traditional heuristics in these settings are simple and interpretable but typically tailored to specific operating regimes, which can lead to inefficiencies when conditions change. We propose a reinforcement learning (RL) approach that learns to orchestrate a set of such expert policies, leveraging their complementary strengths in a data-driven, adaptive manner. Building on the Adv2 framework (Jonckheere et al., 2024), our method combines expert decisions through advantage-based weight updates and extends naturally to settings where only estimated value functions are available. We establish both expectation and high-probability regret guarantees and derive a novel finite-time bias bound for temporal-difference learning, enabling reliable advantage estimation even under constant step size and non-stationary dynamics. To support scalability, we introduce a neural actor-critic architecture that generalizes across large state spaces while preserving interpretability. Simulations on stochastic matching models, including an organ exchange scenario, show that the orchestrated policy converges faster and yields higher system level efficiency than both individual experts and conventional RL baselines. Our results highlight how structured, adaptive learning can improve the modeling and management of complex resource allocation and decision-making processes.
☆ A Median Perspective on Unlabeled Data for Out-of-Distribution Detection
Out-of-distribution (OOD) detection plays a crucial role in ensuring the robustness and reliability of machine learning systems deployed in real-world applications. Recent approaches have explored the use of unlabeled data, showing potential for enhancing OOD detection capabilities. However, effectively utilizing unlabeled in-the-wild data remains challenging due to the mixed nature of both in-distribution (InD) and OOD samples. The lack of a distinct set of OOD samples complicates the task of training an optimal OOD classifier. In this work, we introduce Medix, a novel framework designed to identify potential outliers from unlabeled data using the median operation. We use the median because it provides a stable estimate of the central tendency, as an OOD detection mechanism, due to its robustness against noise and outliers. Using these identified outliers, along with labeled InD data, we train a robust OOD classifier. From a theoretical perspective, we derive error bounds that demonstrate Medix achieves a low error rate. Empirical results further substantiate our claims, as Medix outperforms existing methods across the board in open-world settings, confirming the validity of our theoretical insights.
☆ ATLO-ML: Adaptive Time-Length Optimizer for Machine Learning -- Insights from Air Quality Forecasting
Accurate time-series predictions in machine learning are heavily influenced by the selection of appropriate input time length and sampling rate. This paper introduces ATLO-ML, an adaptive time-length optimization system that automatically determines the optimal input time length and sampling rate based on user-defined output time length. The system provides a flexible approach to time-series data pre-processing, dynamically adjusting these parameters to enhance predictive performance. ATLO-ML is validated using air quality datasets, including both GAMS-dataset and proprietary data collected from a data center, both in time series format. Results demonstrate that utilizing the optimized time length and sampling rate significantly improves the accuracy of machine learning models compared to fixed time lengths. ATLO-ML shows potential for generalization across various time-sensitive applications, offering a robust solution for optimizing temporal input parameters in machine learning workflows.
GUIDE: Guided Initialization and Distillation of Embeddings
Algorithmic efficiency techniques such as distillation (\cite{hinton2015distillation}) are useful in improving model quality without increasing serving costs, provided a larger teacher model is available for a smaller student model to learn from during training. Standard distillation methods are limited to only forcing the student to match the teacher's outputs. Given the costs associated with training a large model, we believe we should be extracting more useful information from a teacher model than by just making the student match the teacher's outputs. In this paper, we introduce \guide (Guided Initialization and Distillation of Embeddings). \guide can be considered a distillation technique that forces the student to match the teacher in the parameter space. Using \guide we show 25-26\% reduction in the teacher-student quality gap when using large student models (400M - 1B parameters) trained on $\approx$ 20B tokens. We also present a thorough analysis demonstrating that \guide can be combined with knowledge distillation with near additive improvements. Furthermore, we show that applying \guide alone leads to substantially better model quality than applying knowledge distillation by itself. Most importantly, \guide introduces no training or inference overhead and hence any model quality gains from our method are virtually free.
☆ Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift
We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), applying anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift -- the log-likelihood ratio between full models and deliberately weakened "skeleton" baselines -- using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation by 22-28% vs. sequential baselines while maintaining delta-level control with 12% computational overhead. We introduce automated skeletons (distilled submodels, randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries + verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness -- 10.9% of stopped sequences remain incorrect even with the gate (13.2-22.7% without it). EDFL serves as a first-stage filter reducing verification burden by 83%, not as a standalone solution for safety-critical domains.
Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin
Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.
☆ How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation
Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.
☆ Context-Aware Inference via Performance Forecasting in Decentralized Learning Networks
In decentralized learning networks, predictions from many participants are combined to generate a network inference. While many studies have demonstrated performance benefits of combining multiple model predictions, existing strategies using linear pooling methods (ranging from simple averaging to dynamic weight updates) face a key limitation. Dynamic prediction combinations that rely on historical performance to update weights are necessarily reactive. Due to the need to average over a reasonable number of epochs (with moving averages or exponential weighting), they tend to be slow to adjust to changing circumstances (phase or regime changes). In this work, we develop a model that uses machine learning to forecast the performance of predictions by models at each epoch in a time series. This enables `context-awareness' by assigning higher weight to models that are likely to be more accurate at a given time. We show that adding a performance forecasting worker in a decentralized learning network, following a design similar to the Allora network, can improve the accuracy of network inferences. Specifically, we find forecasting models that predict regret (performance relative to the network inference) or regret z-score (performance relative to other workers) show greater improvement than models predicting losses, which often do not outperform the naive network inference (historically weighted average of all inferences). Through a series of optimization tests, we show that the performance of the forecasting model can be sensitive to choices in the feature set and number of training epochs. These properties may depend on the exact problem and should be tailored to each domain. Although initially designed for a decentralized learning network, using performance forecasting for prediction combination may be useful in any situation where predictive rather than reactive model weighting is needed.
comment: 17 pages, 12 figures; appeared in ADI (October 2025)
☆ Road Surface Condition Detection with Machine Learning using New York State Department of Transportation Camera Images and Weather Forecast Data
The New York State Department of Transportation (NYSDOT) has a network of roadside traffic cameras that are used by both the NYSDOT and the public to observe road conditions. The NYSDOT evaluates road conditions by driving on roads and observing live cameras, tasks which are labor-intensive but necessary for making critical operational decisions during winter weather events. However, machine learning models can provide additional support for the NYSDOT by automatically classifying current road conditions across the state. In this study, convolutional neural networks and random forests are trained on camera images and weather data to predict road surface conditions. Models are trained on a hand-labeled dataset of ~22,000 camera images, each classified by human labelers into one of six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed. Model generalizability is prioritized to meet the operational needs of the NYSDOT decision makers, and the weather-related road surface condition model in this study achieves an accuracy of 81.5% on completely unseen cameras.
☆ Bayesian Optimization under Uncertainty for Training a Scale Parameter in Stochastic Models
Hyperparameter tuning is a challenging problem especially when the system itself involves uncertainty. Due to noisy function evaluations, optimization under uncertainty can be computationally expensive. In this paper, we present a novel Bayesian optimization framework tailored for hyperparameter tuning under uncertainty, with a focus on optimizing a scale- or precision-type parameter in stochastic models. The proposed method employs a statistical surrogate for the underlying random variable, enabling analytical evaluation of the expectation operator. Moreover, we derive a closed-form expression for the optimizer of the random acquisition function, which significantly reduces computational cost per iteration. Compared with a conventional one-dimensional Monte Carlo-based optimization scheme, the proposed approach requires 40 times fewer data points, resulting in up to a 40-fold reduction in computational cost. We demonstrate the effectiveness of the proposed method through two numerical examples in computational engineering.
☆ Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization
Learning from temporally-correlated data is a core facet of modern machine learning. Yet our understanding of sequential learning remains incomplete, particularly in the multi-trajectory setting where data consists of many independent realizations of a time-indexed stochastic process. This important regime both reflects modern training pipelines such as for large foundation models, and offers the potential for learning without the typical mixing assumptions made in the single-trajectory case. However, instance-optimal bounds are known only for least-squares regression with dependent covariates; for more general models or loss functions, the only broadly applicable guarantees result from a reduction to either i.i.d. learning, with effective sample size scaling only in the number of trajectories, or an existing single-trajectory result when each individual trajectory mixes, with effective sample size scaling as the full data budget deflated by the mixing-time. In this work, we significantly broaden the scope of instance-optimal rates in multi-trajectory settings via the Hellinger localization framework, a general approach for maximum likelihood estimation. Our method proceeds by first controlling the squared Hellinger distance at the path-measure level via a reduction to i.i.d. learning, followed by localization as a quadratic form in parameter space weighted by the trajectory Fisher information. This yields instance-optimal bounds that scale with the full data budget under a broad set of conditions. We instantiate our framework across four diverse case studies: a simple mixture of Markov chains, dependent linear regression under non-Gaussian noise, generalized linear models with non-monotonic activations, and linear-attention sequence models. In all cases, our bounds nearly match the instance-optimal rates from asymptotic normality, substantially improving over standard reductions.
☆ Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting
Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.
☆ The Effect of Label Noise on the Information Content of Neural Representations
In supervised classification tasks, models are trained to predict a label for each data point. In real-world datasets, these labels are often noisy due to annotation errors. While the impact of label noise on the performance of deep learning models has been widely studied, its effects on the networks' hidden representations remain poorly understood. We address this gap by systematically comparing hidden representations using the Information Imbalance, a computationally efficient proxy of conditional mutual information. Through this analysis, we observe that the information content of the hidden representations follows a double descent as a function of the number of network parameters, akin to the behavior of the test error. We further demonstrate that in the underparameterized regime, representations learned with noisy labels are more informative than those learned with clean labels, while in the overparameterized regime, these representations are equally informative. Our results indicate that the representations of overparameterized networks are robust to label noise. We also found that the information imbalance between the penultimate and pre-softmax layers decreases with cross-entropy loss in the overparameterized regime. This offers a new perspective on understanding generalization in classification tasks. Extending our analysis to representations learned from random labels, we show that these perform worse than random features. This indicates that training on random labels drives networks much beyond lazy learning, as weights adapt to encode labels information.
comment: 10 pages, 5 figures
☆ Geometry-Aware Backdoor Attacks: Leveraging Curvature in Hyperbolic Embeddings
Non-Euclidean foundation models increasingly place representations in curved spaces such as hyperbolic geometry. We show that this geometry creates a boundary-driven asymmetry that backdoor triggers can exploit. Near the boundary, small input changes appear subtle to standard input-space detectors but produce disproportionately large shifts in the model's representation space. Our analysis formalizes this effect and also reveals a limitation for defenses: methods that act by pulling points inward along the radius can suppress such triggers, but only by sacrificing useful model sensitivity in that same direction. Building on these insights, we propose a simple geometry-adaptive trigger and evaluate it across tasks and architectures. Empirically, attack success increases toward the boundary, whereas conventional detectors weaken, mirroring the theoretical trends. Together, these results surface a geometry-specific vulnerability in non-Euclidean models and offer analysis-backed guidance for designing and understanding the limits of defenses.
☆ Making and Evaluating Calibrated Forecasts
Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.
☆ Monte Carlo Permutation Search
We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE algorithm. MCPS is relevant when deep reinforcement learning is not an option, or when the computing power available before play is not substantial, such as in General Game Playing, for example. The principle of MCPS is to include in the exploration term of a node the statistics on all the playouts that contain all the moves on the path from the root to the node. We extensively test MCPS on a variety of games: board games, wargame, investment game, video game and multi-player games. MCPS has better results than GRAVE in all the two-player games. It has equivalent results for multi-player games because these games are inherently balanced even when players have different strengths. We also show that using abstract codes for moves instead of exact codes can be beneficial to both MCPS and GRAVE, as they improve the permutation statistics and the AMAF statistics. We also provide a mathematical derivation of the formulas used for weighting the three sources of statistics. These formulas are an improvement on the GRAVE formula since they no longer use the bias hyperparameter of GRAVE. Moreover, MCPS is not sensitive to the ref hyperparameter.
☆ Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data
Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel \textit{Relational Attention} mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 94% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT's zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.
comment: preprint; under review
☆ A General Constructive Upper Bound on Shallow Neural Nets Complexity
We provide an upper bound on the number of neurons required in a shallow neural network to approximate a continuous function on a compact set with a given accuracy. This method, inspired by a specific proof of the Stone-Weierstrass theorem, is constructive and more general than previous bounds of this character, as it applies to any continuous function on any compact set.
☆ Lagrangian neural ODEs: Measuring the existence of a Lagrangian with Helmholtz metrics NeurIPS 2025
Neural ODEs are a widely used, powerful machine learning technique in particular for physics. However, not every solution is physical in that it is an Euler-Lagrange equation. We present Helmholtz metrics to quantify this resemblance for a given ODE and demonstrate their capabilities on several fundamental systems with noise. We combine them with a second order neural ODE to form a Lagrangian neural ODE, which allows to learn Euler-Lagrange equations in a direct fashion and with zero additional inference cost. We demonstrate that, using only positional data, they can distinguish Lagrangian and non-Lagrangian systems and improve the neural ODE solutions.
comment: Accepted for the NeurIPS 2025 Machine Learning and the Physical Sciences workshop. 6 pages, 3 figures
Diffusion-Guided Renormalization of Neural Systems via Tensor Networks
Far from equilibrium, neural systems self-organize across multiple scales. Exploiting multiscale self-organization in neuroscience and artificial intelligence requires a computational framework for modeling the effective non-equilibrium dynamics of stochastic neural trajectories. Non-equilibrium thermodynamics and representational geometry offer theoretical foundations, but we need scalable data-driven techniques for modeling collective properties of high-dimensional neural networks from partial subsampled observations. Renormalization is a coarse-graining technique central to studying emergent scaling properties of many-body and nonlinear dynamical systems. While widely applied in physics and machine learning, coarse-graining complex dynamical networks remains unsolved, affecting many computational sciences. Recent diffusion-based renormalization, inspired by quantum statistical mechanics, coarse-grains networks near entropy transitions marked by maximal changes in specific heat or information transmission. Here I explore diffusion-based renormalization of neural systems by generating symmetry-breaking representations across scales and offering scalable algorithms using tensor networks. Diffusion-guided renormalization bridges microscale and mesoscale dynamics of dissipative neural systems. For microscales, I developed a scalable graph inference algorithm for discovering community structure from subsampled neural activity. Using community-based node orderings, diffusion-guided renormalization generates renormalization group flow through metagraphs and joint probability functions. Towards mesoscales, diffusion-guided renormalization targets learning the effective non-equilibrium dynamics of dissipative neural trajectories occupying lower-dimensional subspaces, enabling coarse-to-fine control in systems neuroscience and artificial intelligence.
comment: Reformatted version of Dissertation submitted for the Doctor of Philosophy in Systems and Control Engineering at Case Western Reserve University, 2025
☆ PIKAN: Physics-Inspired Kolmogorov-Arnold Networks for Explainable UAV Channel Modelling
Unmanned aerial vehicle (UAV) communications demand accurate yet interpretable air-to-ground (A2G) channel models that can adapt to nonstationary propagation environments. While deterministic models offer interpretability and deep learning (DL) models provide accuracy, both approaches suffer from either rigidity or a lack of explainability. To bridge this gap, we propose the Physics-Inspired Kolmogorov-Arnold Network (PIKAN) that embeds physical principles (e.g., free-space path loss, two-ray reflections) into the learning process. Unlike physics-informed neural networks (PINNs), PIKAN is more flexible for applying physical information because it introduces them as flexible inductive biases. Thus, it enables a more flexible training process. Experiments on UAV A2G measurement data show that PIKAN achieves comparable accuracy to DL models while providing symbolic and explainable expressions aligned with propagation laws. Remarkably, PIKAN achieves this performance with only 232 parameters, making it up to 37 times lighter than multilayer perceptron (MLP) baselines with thousands of parameters, without sacrificing correlation with measurements and also providing symbolic expressions. These results highlight PIKAN as an efficient, interpretable, and scalable solution for UAV channel modelling in beyond-5G and 6G networks.
☆ TransFIRA: Transfer Learning for Face Image Recognizability Assessment
Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder's decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary--aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment -- encoder-specific, accurate, interpretable, and extensible across modalities -- significantly advancing FIQA in accuracy, explainability, and scope.
comment: Project Page: https://transfira.github.io/
☆ Asking For It: Question-Answering for Predicting Rule Infractions in Online Content Moderation
Online communities rely on a mix of platform policies and community-authored rules to define acceptable behavior and maintain order. However, these rules vary widely across communities, evolve over time, and are enforced inconsistently, posing challenges for transparency, governance, and automation. In this paper, we model the relationship between rules and their enforcement at scale, introducing ModQ, a novel question-answering framework for rule-sensitive content moderation. Unlike prior classification or generation-based approaches, ModQ conditions on the full set of community rules at inference time and identifies which rule best applies to a given comment. We implement two model variants - extractive and multiple-choice QA - and train them on large-scale datasets from Reddit and Lemmy, the latter of which we construct from publicly available moderation logs and rule descriptions. Both models outperform state-of-the-art baselines in identifying moderation-relevant rule violations, while remaining lightweight and interpretable. Notably, ModQ models generalize effectively to unseen communities and rules, supporting low-resource moderation settings and dynamic governance environments.
comment: Accepted at ICWSM 2026
☆ Flexible Swarm Learning May Outpace Foundation Models in Essential Tasks
Foundation models have rapidly advanced AI, raising the question of whether their decisions will ultimately surpass human strategies in real-world domains. The exponential, and possibly super-exponential, pace of AI development makes such analysis elusive. Nevertheless, many application areas that matter for daily life and society show only modest gains so far; a prominent case is diagnosing and treating dynamically evolving disease in intensive care. The common challenge is adapting complex systems to dynamic environments. Effective strategies must optimize outcomes in systems composed of strongly interacting functions while avoiding shared side effects; this requires reliable, self-adaptive modeling. These tasks align with building digital twins of highly complex systems whose mechanisms are not fully or quantitatively understood. It is therefore essential to develop methods for self-adapting AI models with minimal data and limited mechanistic knowledge. As this challenge extends beyond medicine, AI should demonstrate clear superiority in these settings before assuming broader decision-making roles. We identify the curse of dimensionality as a fundamental barrier to efficient self-adaptation and argue that monolithic foundation models face conceptual limits in overcoming it. As an alternative, we propose a decentralized architecture of interacting small agent networks (SANs). We focus on agents representing the specialized substructure of the system, where each agent covers only a subset of the full system functions. Drawing on mathematical results on the learning behavior of SANs and evidence from existing applications, we argue that swarm-learning in diverse swarms can enable self-adaptive SANs to deliver superior decision-making in dynamic environments compared with monolithic foundation models, though at the cost of reduced reproducibility in detail.
☆ Conditional Denoising Diffusion Model-Based Robust MR Image Reconstruction from Highly Undersampled Data
Magnetic Resonance Imaging (MRI) is a critical tool in modern medical diagnostics, yet its prolonged acquisition time remains a critical limitation, especially in time-sensitive clinical scenarios. While undersampling strategies can accelerate image acquisition, they often result in image artifacts and degraded quality. Recent diffusion models have shown promise for reconstructing high-fidelity images from undersampled data by learning powerful image priors; however, most existing approaches either (i) rely on unsupervised score functions without paired supervision or (ii) apply data consistency only as a post-processing step. In this work, we introduce a conditional denoising diffusion framework with iterative data-consistency correction, which differs from prior methods by embedding the measurement model directly into every reverse diffusion step and training the model on paired undersampled-ground truth data. This hybrid design bridges generative flexibility with explicit enforcement of MRI physics. Experiments on the fastMRI dataset demonstrate that our framework consistently outperforms recent state-of-the-art deep learning and diffusion-based methods in SSIM, PSNR, and LPIPS, with LPIPS capturing perceptual improvements more faithfully. These results demonstrate that integrating conditional supervision with iterative consistency updates yields substantial improvements in both pixel-level fidelity and perceptual realism, establishing a principled and practical advance toward robust, accelerated MRI reconstruction.
♻ ☆ AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives
Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model's awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model's vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.
♻ ☆ A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data
In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach extends the Information Bottleneck principle to heterogeneous data through generalised product kernels, integrating continuous, nominal, and ordinal variables within a unified optimization framework. We address the following challenges: developing a systematic bandwidth selection strategy that equalises contributions across variable types, and proposing an adaptive hyperparameter updating scheme that ensures a valid solution into a predetermined number of potentially imbalanced clusters. Through simulations on 28,800 synthetic data sets and ten publicly available benchmarks, we demonstrate that the proposed method, named DIBmix, achieves superior performance compared to four established methods (KAMILA, K-Prototypes, FAMD with K-Means, and PAM with Gower's dissimilarity). Results show DIBmix particularly excels when clusters exhibit size imbalances, data contain low or moderate cluster overlap, and categorical and continuous variables are equally represented. The method presents a significant advantage over traditional centroid-based algorithms, establishing DIBmix as a competitive and theoretically grounded alternative for mixed-type data clustering.
comment: 33 pages
♻ ☆ DP-HYPE: Distributed Differentially Private Hyperparameter Search
The tuning of hyperparameters in distributed machine learning can substantially impact model performance. When the hyperparameters are tuned on sensitive data, privacy becomes an important challenge and to this end, differential privacy has emerged as the de facto standard for provable privacy. A standard setting when performing distributed learning tasks is that clients agree on a shared setup, i.e., find a compromise from a set of hyperparameters, like the learning rate of the model to be trained. Yet, prior work on differentially private hyperparameter tuning either uses computationally expensive cryptographic protocols, determines hyperparameters separately for each client, or applies differential privacy locally, which can lead to undesirable utility-privacy trade-offs. In this work, we present our algorithm DP-HYPE, which performs a distributed and privacy-preserving hyperparameter search by conducting a distributed voting based on local hyperparameter evaluations of clients. In this way, DP-HYPE selects hyperparameters that lead to a compromise supported by the majority of clients, while maintaining scalability and independence from specific learning tasks. We prove that DP-HYPE preserves the strong notion of differential privacy called client-level differential privacy and, importantly, show that its privacy guarantees do not depend on the number of hyperparameters. We also provide bounds on its utility guarantees, that is, the probability of reaching a compromise, and implement DP-HYPE as a submodule in the popular Flower framework for distributed machine learning. In addition, we evaluate performance on multiple benchmark data sets in iid as well as multiple non-iid settings and demonstrate high utility of DP-HYPE even under small privacy budgets.
♻ ☆ LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.
♻ ☆ TranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems NeurIPS 2025
Regression models are crucial in recommender systems. However, retransformation bias problem has been conspicuously neglected within the community. While many works in other fields have devised effective bias correction methods, all of them are post-hoc cures externally to the model, facing practical challenges when applied to real-world recommender systems. Hence, we propose a preemptive paradigm to eradicate the bias intrinsically from the models via minor model refinement. Specifically, a novel TranSUN method is proposed with a joint bias learning manner to offer theoretically guaranteed unbiasedness under empirical superior convergence. It is further generalized into a novel generic regression model family, termed Generalized TranSUN (GTS), which not only offers more theoretical insights but also serves as a generic framework for flexibly developing various bias-free models. Comprehensive experimental results demonstrate the superiority of our methods across data from various domains, which have been successfully deployed in two real-world industrial recommendation scenarios, i.e. product and short video recommendation scenarios in Guess What You Like business domain in the homepage of Taobao App (a leading e-commerce platform with DAU > 300M), to serve the major online traffic.
comment: 37 pages, 6 figures, NeurIPS 2025 Poster
♻ ☆ HOG-Diff: Higher-Order Guided Diffusion for Graph Generation
Graph generation is a critical yet challenging task as empirical analyses require a deep understanding of complex, non-Euclidean structures. Diffusion models have recently made significant achievements in graph generation, but these models are typically adapted from image generation frameworks and overlook inherent higher-order topology, leaving them ill-suited for capturing the topological properties of graphs. In this work, we propose Higher-order Guided Diffusion (HOG-Diff), a principled framework that progressively generates plausible graphs with inherent topological structures. HOG-Diff follows a coarse-to-fine generation curriculum guided by higher-order topology and implemented via diffusion bridges. We further prove that our model exhibits a stronger theoretical guarantee than classical diffusion frameworks. Extensive experiments on both molecular and generic graph generation tasks demonstrate that our method consistently outperforms or remains competitive with state-of-the-art baselines. Our code is available at https://github.com/Yiminghh/HOG-Diff.
♻ ☆ Hierarchical Reasoning Models: Perspectives and Misconceptions
Transformers have demonstrated remarkable performance in natural language processing and related domains, as they largely focus on sequential, autoregressive next-token prediction tasks. Yet, they struggle in logical reasoning, not necessarily because of a fundamental limitation of these models, but possibly due to the lack of exploration of more creative uses, such as latent space and recurrent reasoning. An emerging exploration in this direction is the Hierarchical Reasoning Model (Wang et. al., 2025), which introduces a novel type of recurrent reasoning in the latent space of transformers, achieving remarkable performance on a wide range of 2D reasoning tasks. Despite the promising results, this line of models is still at an early stage and calls for in-depth investigation. In this work, we review this class of models, examine key design choices, test alternative variants and clarify common misconceptions.
comment: Found errors in some results of v1. Removed them and changed conclusions
♻ ☆ A Fairness-Aware Strategy for B5G Physical-layer Security Leveraging Reconfigurable Intelligent Surfaces
Reconfigurable Intelligent Surfaces are composed of physical elements that can dynamically alter electromagnetic wave properties to enhance beamforming and lead to improvements in areas with low coverage properties. Combined with Reinforcement Learning techniques, they have the potential to be conduct as well physical-layer security hardening. Yet, and in addition to security improvements, it is crucial to consider the concept of fair communication. Reconfigurable Intelligent Surfaces must ensure that User Equipment units receive their signals with adequate strength, without other units being deprived of service due to insufficient power. In this paper, we address such a problem. We explore the fairness properties of previous work and propose a novel method that aims at obtaining both an efficient and fair duplex Reconfigurable Intelligent Surface-Reinforcement Learning system for multiple legitimate User Equipment units without reducing the level of achieved physical-layer security hardening. In terms of contributions, we uncover a fairness imbalance of a previous physical-layer security hardening solution, validate our findings and report experimental work via simulation results. We also provide an alternative reward strategy to solve the uncovered problems and release both code and datasets to foster further research in the topics of this paper.
comment: 19 pages, 5 figures, 2 tables, 41 references
♻ ☆ How Reliable are Causal Probing Interventions?
Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: completeness (how thoroughly the representation of the target property has been transformed) and selectivity (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as reliability, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) all methods show a clear tradeoff between completeness and selectivity; (2) more complete and reliable methods have a greater impact on LLM behavior; and (3) nonlinear interventions are almost always more reliable than linear interventions.
♻ ☆ Gemstones: A Model Suite for Multi-Faceted Scaling Laws NeurIPS 2025
Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.
comment: NeurIPS 2025
♻ ☆ Optimal Policy Minimum Bayesian Risk
Inference scaling helps LLMs solve complex reasoning problems through extended runtime computation. On top of long chain-of-thought (long-CoT) models, purely inference-time techniques such as best-of-N (BoN) sampling, majority voting, or more generally, minimum Bayes risk decoding (MBRD), can further improve LLM accuracy by generating multiple candidate solutions and aggregating over them. These methods typically leverage additional signals in the form of reward models and risk/similarity functions that compare generated samples, e.g., exact match in some normalized space or standard similarity metrics such as Rouge. Here we present a novel method for incorporating reward and risk/similarity signals into MBRD. Based on the concept of optimal policy in KL-controlled reinforcement learning, our framework provides a simple and well-defined mechanism for leveraging such signals, offering several advantages over traditional inference-time methods: higher robustness, improved accuracy, and well-understood asymptotic behavior. In addition, it allows for the development of a sample-efficient variant of MBRD that can adjust the number of samples to generate according to the difficulty of the problem, without relying on majority vote counts. We empirically demonstrate the advantages of our approach on math (MATH-$500$) and coding (HumanEval) tasks using recent open-source models. We also present a comprehensive analysis of its accuracy-compute trade-offs.
♻ ☆ Spatiotemporal Graph Learning with Direct Volumetric Information Passing and Feature Enhancement
Data-driven learning of physical systems has kindled significant attention, where many neural models have been developed. In particular, mesh-based graph neural networks (GNNs) have demonstrated significant potential in modeling spatiotemporal dynamics across arbitrary geometric domains. However, the existing node-edge message-passing and aggregation mechanism in GNNs limits the representation learning ability. In this paper, we proposed a dual-module framework, Cell-embedded and Feature-enhanced Graph Neural Network (aka, CeFeGNN), for learning spatiotemporal dynamics. Specifically, we embed learnable cell attributions to the common node-edge message passing process, which better captures the spatial dependency of regional features. Such a strategy essentially upgrades the local aggregation scheme from first order (e.g., from edge to node) to a higher order (e.g., from volume and edge to node), which takes advantage of volumetric information in message passing. Meanwhile, a novel feature-enhanced block is designed to further improve the model's performance and alleviate the over-smoothness problem. Extensive experiments on various PDE systems and one real-world dataset demonstrate that CeFeGNN achieves superior performance compared with other baselines.
♻ ☆ Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs), but this process can also increase harmful outputs in adversarial settings. In this work, we ask whether the long CoTs can be leveraged for predictive safety monitoring: do the reasoning traces provide early signals of final response alignment that could enable timely intervention? We evaluate a range of monitoring methods using either CoT text or activations, including highly capable large language models, fine-tuned classifiers, and humans. First, we find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe, with an average absolute increase of 13 in F1 scores over the best-performing alternatives. CoT texts are often unfaithful and misleading, while model latents provide a more reliable predictive signal. Second, the probe can be applied to early CoT segments before the response is generated, showing that alignment signals appear before reasoning completes. Error analysis reveals that the performance gap between text classifiers and the linear probe largely stems from a subset of responses we call performative CoTs, where the reasoning consistently contradicts the final response as the CoT progresses. Our findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.
♻ ☆ Robust-Multi-Task Gradient Boosting
Multi-task learning (MTL) has shown effectiveness in exploiting shared information across tasks to improve generalization. MTL assumes tasks share similarities that can improve performance. In addition, boosting algorithms have demonstrated exceptional performance across diverse learning problems, primarily due to their ability to focus on hard-to-learn instances and iteratively reduce residual errors. This makes them a promising approach for learning multi-task problems. However, real-world MTL scenarios often involve tasks that are not well-aligned (known as outlier or adversarial tasks), which do not share beneficial similarities with others and can, in fact, deteriorate the performance of the overall model. To overcome this challenge, we propose Robust-Multi-Task Gradient Boosting (R-MTGB), a novel boosting framework that explicitly models and adapts to task heterogeneity during training. R-MTGB structures the learning process into three sequential blocks: (1) learning shared patterns, (2) partitioning tasks into outliers and non-outliers with regularized parameters, and (3) fine-tuning task-specific predictors. This architecture enables R-MTGB to automatically detect and penalize outlier tasks while promoting effective knowledge transfer among related tasks. Our method integrates these mechanisms seamlessly within gradient boosting, allowing robust handling of noisy or adversarial tasks without sacrificing accuracy. Extensive experiments on both synthetic benchmarks and real-world datasets demonstrate that our approach successfully isolates outliers, transfers knowledge, and consistently reduces prediction errors for each task individually, and achieves overall performance gains across all tasks. These results highlight robustness, adaptability, and reliable convergence of R-MTGB in challenging MTL environments.
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
comment: 16 pages; 8 figures, 4 tables v2 changelog: Fixed the modeling for table 3, random effect is the model version
♻ ☆ CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
comment: 16 pages, 9 figures, 7 tables
♻ ☆ Unified Cross-Modal Medical Image Synthesis with Hierarchical Mixture of Product-of-Experts
We propose a deep mixture of multimodal hierarchical variational auto-encoders called MMHVAE that synthesizes missing images from observed images in different modalities. MMHVAE's design focuses on tackling four challenges: (i) creating a complex latent representation of multimodal data to generate high-resolution images; (ii) encouraging the variational distributions to estimate the missing information needed for cross-modal image synthesis; (iii) learning to fuse multimodal information in the context of missing data; (iv) leveraging dataset-level information to handle incomplete data sets at training time. Extensive experiments are performed on the challenging problem of pre-operative brain multi-parametric magnetic resonance and intra-operative ultrasound imaging.
comment: Accepted in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
♻ ☆ Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction
In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.
♻ ☆ Fundamental Limits of Membership Inference Attacks on Machine Learning Models
Membership inference attacks (MIA) can reveal whether a particular data point was part of the training dataset, potentially exposing sensitive information about individuals. This article provides theoretical guarantees by exploring the fundamental statistical limitations associated with MIAs on machine learning models at large. More precisely, we first derive the statistical quantity that governs the effectiveness and success of such attacks. We then theoretically prove that in a non-linear regression setting with overfitting learning procedures, attacks may have a high probability of success. Finally, we investigate several situations for which we provide bounds on this quantity of interest. Interestingly, our findings indicate that discretizing the data might enhance the learning procedure's security. Specifically, it is demonstrated to be limited by a constant, which quantifies the diversity of the underlying data distribution. We illustrate those results through simple simulations.
comment: Accepted for publication in JMLR
♻ ☆ Teaching Metric Distance to Discrete Autoregressive Language Models
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.
♻ ☆ BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
LLM web agents now browse and take actions on the open web, yet current agent evaluations are constrained to sandboxed environments or artificial tasks. We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks, runs Arena-style head-to-head comparisons, and uses step-level human feedback to surface failure modes. Collecting and analyzing step-level annotations on the agent traces, we identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs. By constructing targeted datasets to further study these tasks, we discover variations in how different language models navigate these failure modes. We find, for example, that o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models and DeepSeek-R1 consistently misleads users about pop-up banner closure. Our findings surface both the diversity and brittleness of current web agents. More broadly, our benchmarking methodology provides an approach to evaluating and understanding web agent failure modes at scale.
♻ ☆ ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NeurIPS 2025
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
comment: Accepted at NeurIPS 2025 (oral)
♻ ☆ Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models NeurIPS 2025
Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model's expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N \times N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model
comment: 10 pages, NeurIPS 2025 Spotlight
♻ ☆ SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection
Predicting earnings surprises from financial documents, such as earnings conference calls, regulatory filings, and financial news, has become increasingly important in financial economics. However, these financial documents present significant analytical challenges, typically containing over 5,000 words with substantial redundancy and industry-specific terminology that creates obstacles for language models. In this work, we propose the SAE-FiRE (Sparse Autoencoder for Financial Representation Enhancement) framework to address these limitations by extracting key information while eliminating redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to decompose dense neural representations from large language models into interpretable sparse components, then applies statistical feature selection methods, including ANOVA F-tests and tree-based importance scoring, to identify the top-k most discriminative dimensions for classification. By systematically filtering out noise that might otherwise lead to overfitting, we enable more robust and generalizable predictions. Experimental results across three financial datasets demonstrate that SAE-FiRE significantly outperforms baseline approaches.
♻ ☆ Investigating Forecasting Models for Pandemic Infections Using Heterogeneous Data Sources: A 2-year Study with COVID-19
Emerging in December 2019, the COVID-19 pandemic caused widespread health, economic, and social disruptions. Rapid global transmission overwhelmed healthcare systems, resulting in high infection rates, hospitalisations, and fatalities. To minimise the spread, governments implemented several non-pharmaceutical interventions like lockdowns and travel restrictions. While effective in controlling transmission, these measures also posed significant economic and societal challenges. Although the WHO declared COVID-19 no longer a global health emergency in May 2023, its impact persists, shaping public health strategies. The vast amount of data collected during the pandemic offers valuable insights into disease dynamics, transmission, and intervention effectiveness. Leveraging these insights can improve forecasting models, enhancing preparedness and response to future outbreaks while mitigating their social and economic impact. This paper presents a large-scale case study on COVID-19 forecasting in Cyprus, utilising a two-year dataset that integrates epidemiological data, vaccination records, policy measures, and weather conditions. We analyse infection trends, assess forecasting performance, and examine the influence of external factors on disease dynamics. The insights gained contribute to improved pandemic preparedness and response strategies.
comment: Keywords: epidemiology, pandemic forecasting, COVID-19, infections, machine learning. Published in: IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 2025
♻ ☆ Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks
In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.
comment: Accepted to EMNLP 2025 Main Conference (Oral), camera-ready version
♻ ☆ FedFlex: Federated Learning for Diverse Netflix Recommendations
The drive for personalization in recommender systems creates a tension between user privacy and the risk of "filter bubbles". Although federated learning offers a promising paradigm for privacy-preserving recommendations, its impact on diversity remains unclear. We introduce FedFlex, a two-stage framework that combines local, on-device fine-tuning of matrix factorization models (SVD and BPR) with a lightweight Maximal Marginal Relevance (MMR) re-ranking step to promote diversity. We conducted the first live user study of a federated recommender, collecting behavioral data and feedback during a two-week online deployment. Our results show that FedFlex successfully engages users, with BPR outperforming SVD in click-through rate. Re-ranking with MMR consistently improved ranking quality (nDCG) across both models, with statistically significant gains, particularly for BPR. Diversity effects varied: MMR increased coverage for both models and improved intra-list diversity for BPR, but slightly reduced it for SVD, suggesting different interactions between personalization and diversification across models. Our exit questionnaire responses indicated that most users expressed no clear preference between re-ranked and unprocessed lists, implying that increased diversity did not substantially reduce user satisfaction.
♻ ☆ Neon: Negative Extrapolation From Self-Training Improves Image Generation
Scaling generative AI models is bottlenecked by the scarcity of high-quality training data. The ease of synthesizing from a generative model suggests using (unverified) synthetic data to augment a limited corpus of real data for the purpose of fine-tuning in the hope of improving performance. Unfortunately, however, the resulting positive feedback loop leads to model autophagy disorder (MAD, aka model collapse) that results in a rapid degradation in sample quality and/or diversity. In this paper, we introduce Neon (for Negative Extrapolation frOm self-traiNing), a new learning method that turns the degradation from self-training into a powerful signal for self-improvement. Given a base model, Neon first fine-tunes it on its own self-synthesized data but then, counterintuitively, reverses its gradient updates to extrapolate away from the degraded weights. We prove that Neon works because typical inference samplers that favor high-probability regions create a predictable anti-alignment between the synthetic and real data population gradients, which negative extrapolation corrects to better align the model with the true data distribution. Neon is remarkably easy to implement via a simple post-hoc merge that requires no new real data, works effectively with as few as 1k synthetic samples, and typically uses less than 1% additional training compute. We demonstrate Neon's universality across a range of architectures (diffusion, flow matching, autoregressive, and inductive moment matching models) and datasets (ImageNet, CIFAR-10, and FFHQ). In particular, on ImageNet 256x256, Neon elevates the xAR-L model to a new state-of-the-art FID of 1.02 with only 0.36% additional training compute. Code is available at https://github.com/VITA-Group/Neon
♻ ☆ A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)
Background: Apparent Diffusion Coefficient (ADC) values and Total Diffusion Volume (TDV) from Whole-body diffusion-weighted MRI (WB-DWI) are recognized cancer imaging biomarkers. However, manual disease delineation for ADC and TDV measurements is unfeasible in clinical practice, demanding automation. As a first step, we propose an algorithm to generate fast and reproducible probability maps of the skeleton, adjacent internal organs (liver, spleen, urinary bladder, and kidneys), and spinal canal. Methods: We developed an automated deep-learning pipeline based on a 3D patch-based Residual U-Net architecture that localises and delineates these anatomical structures on WB-DWI. The algorithm was trained using "soft labels" (non-binary segmentations) derived from a computationally intensive atlas-based approach. For training and validation, we employed a multi-centre WB-DWI dataset comprising 532 scans from patients with Advanced Prostate Cancer (APC) or Multiple Myeloma (MM), with testing on 45 patients. Results: Our weakly-supervised deep learning model achieved an average dice score of 0.67 for whole skeletal delineation, 0.76 when excluding ribcage, 0.83 for internal organs, and 0.86 for spinal canal, with average surface distances below 3mm. Relative median ADC differences between automated and manual full-body delineations were below 10%. The model was 12x faster than the atlas-based registration algorithm (25 sec vs. 5 min). Two experienced radiologists rated the model's outputs as either "good" or "excellent" on test scans, with inter-reader agreement from fair to substantial (Gwet's AC1 = 0.27-0.72). Conclusion: The model offers fast, reproducible probability maps for localising and delineating body regions on WB-DWI, potentially enabling non-invasive imaging biomarker quantification to support disease staging and treatment response assessment.
♻ ☆ CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.
comment: Work in progress
♻ ☆ MetaLLMix : An XAI Aided LLM-Meta-learning Based Approach for Hyper-parameters Optimization
Effective model and hyperparameter selection remains a major challenge in deep learning, often requiring extensive expertise and computation. While AutoML and large language models (LLMs) promise automation, current LLM-based approaches rely on trial and error and expensive APIs, which provide limited interpretability and generalizability. We propose MetaLLMiX, a zero-shot hyperparameter optimization framework combining meta-learning, explainable AI, and efficient LLM reasoning. By leveraging historical experiment outcomes with SHAP explanations, MetaLLMiX recommends optimal hyperparameters and pretrained models without additional trials. We further employ an LLM-as-judge evaluation to control output format, accuracy, and completeness. Experiments on eight medical imaging datasets using nine open-source lightweight LLMs show that MetaLLMiX achieves competitive or superior performance to traditional HPO methods while drastically reducing computational cost. Our local deployment outperforms prior API-based approaches, achieving optimal results on 5 of 8 tasks, response time reductions of 99.6-99.9%, and the fastest training times on 6 datasets (2.4-15.7x faster), maintaining accuracy within 1-5% of best-performing baselines.
♻ ☆ How Foundational are Foundation Models for Time Series Forecasting? NeurIPS 2025
Foundation Models are designed to serve as versatile embedding machines, with strong zero shot capabilities and superior generalization performance when fine-tuned on diverse downstream tasks. While this is largely true for language and vision foundation models, we argue that the inherent diversity of time series data makes them less suited for building effective foundation models. We demonstrate this using forecasting as our downstream task. We show that the zero-shot capabilities of a time series foundation model are significantly influenced and tied to the specific domains it has been pretrained on. Furthermore, when applied to unseen real-world time series data, fine-tuned foundation models do not consistently yield substantially better results, relative to their increased parameter count and memory footprint, than smaller, dedicated models tailored to the specific forecasting task at hand.
comment: Typo rectified in this v3 version. Accepted at NeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S)
♻ ☆ Understanding Catastrophic Interference: On the Identifibility of Latent Representations
Catastrophic interference, also known as catastrophic forgetting, is a fundamental challenge in machine learning, where a trained learning model progressively loses performance on previously learned tasks when adapting to new ones. In this paper, we aim to better understand and model the catastrophic interference problem from a latent representation learning point of view, and propose a novel theoretical framework that formulates catastrophic interference as an identification problem. Our analysis demonstrates that the forgetting phenomenon can be quantified by the distance between partial-task aware (PTA) and all-task aware (ATA) setups. Building upon recent advances in identifiability theory, we prove that this distance can be minimized through identification of shared latent variables between these setups. When learning, we propose our method \ourmeos with two-stage training strategy: First, we employ maximum likelihood estimation to learn the latent representations from both PTA and ATA configurations. Subsequently, we optimize the KL divergence to identify and learn the shared latent variables. Through theoretical guarantee and empirical validations, we establish that identifying and learning these shared representations can effectively mitigate catastrophic interference in machine learning systems. Our approach provides both theoretical guarantees and practical performance improvements across both synthetic and benchmark datasets.
♻ ☆ Mitigating Exponential Mixed Frequency Growth through Frequency Selection
Quantum machine learning research has expanded rapidly due to potential computational advantages over classical methods. Angle encoding has emerged as a popular choice as feature map (FM) for embedding classical data into quantum models due to its simplicity and natural generation of truncated Fourier series, providing universal function approximation capabilities. Efficient FMs within quantum circuits can exploit exponential scaling of Fourier frequencies, with multi-dimensional inputs introducing additional exponential growth through mixed-frequency terms. Despite this promising expressive capability, practical implementation faces significant challenges. Through controlled experiments with white-box target functions, we demonstrate that training failures can occur even when all relevant frequencies are theoretically accessible. We illustrate how two primary known causes lead to unsuccessful optimization: insufficient trainable parameters relative to the model's frequency content, and limitations imposed by the ansatz's dynamic lie algebra dimension, but also uncover an additional parameter burden: the necessity of controlling non-unique frequencies within the model. To address this, we propose near-zero weight initialization to suppress unnecessary duplicate frequencies. For target functions with a priori frequency knowledge, we introduce frequency selection as a practical solution that reduces parameter requirements and mitigates the exponential growth that would otherwise render problems intractable due to parameter insufficiency. Our frequency selection approach achieved near-optimal performance (median $R^2 \approx 0.95$) with 78\% of the parameters needed by the best standard approach in 10 randomly chosen target functions.
comment: 10 pages, 3 figures
♻ ☆ Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.
comment: Killian Steunou is the main contributor and corresponding author of this work
♻ ☆ Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD
Matrix factorization mechanisms for differentially private training have emerged as a promising approach to improve model utility under privacy constraints. In practical settings, models are typically trained over multiple epochs, requiring matrix factorizations that account for repeated participation. Existing theoretical upper and lower bounds on multi-epoch factorization error leave a significant gap. In this work, we introduce a new explicit factorization method, Banded Inverse Square Root (BISR), which imposes a banded structure on the inverse correlation matrix. This factorization enables us to derive an explicit and tight characterization of the multi-epoch error. We further prove that BISR achieves asymptotically optimal error by matching the upper and lower bounds. Empirically, BISR performs on par with state-of-the-art factorization methods, while being simpler to implement, computationally efficient, and easier to analyze.
♻ ☆ Fast Policy Learning for Linear Quadratic Control with Entropy Regularization
This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems over an infinite time horizon with entropy regularization. Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC. Moreover, the IPO method can achieve a super-linear convergence rate once it enters a local region around the optimal policy. Finally, when the optimal policy for an RL problem with a known environment is appropriately transferred as the initial policy to an RL problem with an unknown environment, the IPO method is shown to enable a super-linear convergence rate if the two environments are sufficiently close. Performances of these proposed algorithms are supported by numerical examples.
comment: 31 pages, 3 figures
♻ ☆ The Use of Binary Choice Forests to Model and Estimate Discrete Choices
Problem definition. In retailing, discrete choice models (DCMs) are commonly used to capture the choice behavior of customers when offered an assortment of products. When estimating DCMs using transaction data, flexible models (such as machine learning models or nonparametric models) are typically not interpretable and hard to estimate, while tractable models (such as the multinomial logit model) tend to misspecify the complex behavior represeted in the data. Methodology/results. In this study, we use a forest of binary decision trees to represent DCMs. This approach is based on random forests, a popular machine learning algorithm. The resulting model is interpretable: the decision trees can explain the decision-making process of customers during the purchase. We show that our approach can predict the choice probability of any DCM consistently and thus never suffers from misspecification. Moreover, our algorithm predicts assortments unseen in the training data. The mechanism and errors can be theoretically analyzed. We also prove that the random forest can recover preference rankings of customers thanks to the splitting criterion such as the Gini index and information gain ratio. Managerial implications. The framework has unique practical advantages. It can capture customers' behavioral patterns such as irrationality or sequential searches when purchasing a product. It handles nonstandard formats of training data that result from aggregation. It can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product. It can also incorporate price information and customer features. Our numerical experiments using synthetic and real data show that using random forests to estimate customer choices can outperform existing methods.
comment: 63 pages, 10 figures, 30 tables
♻ ☆ Minimizing the Weighted Number of Tardy Jobs: Data-Driven Heuristic for Single-Machine Scheduling
Existing research on single-machine scheduling is largely focused on exact algorithms, which perform well on typical instances but can significantly deteriorate on certain regions of the problem space. In contrast, data-driven approaches provide strong and scalable performance when tailored to the structure of specific datasets. Leveraging this idea, we focus on a single-machine scheduling problem where each job is defined by its weight, duration, due date, and deadline, aiming to minimize the total weight of tardy jobs. We introduce a novel data-driven scheduling heuristic that combines machine learning with problem-specific characteristics, ensuring feasible solutions, which is a common challenge for ML-based algorithms. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art in terms of optimality gap, number of optimal solutions, and adaptability across varied data scenarios, highlighting its flexibility for practical applications. In addition, we conduct a systematic exploration of ML models, addressing a common gap in similar studies by offering a detailed model selection process and providing insights into why the chosen model is the best fit.
comment: Published version: Computers & Operations Research, https://doi.org/10.1016/j.cor.2025.107281. Data are publicly available at https://doi.org/10.5281/zenodo.17233362
♻ ☆ An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.
comment: preprint - under review
♻ ☆ Can foundation models actively gather information in interactive environments to test hypotheses?
Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their ability to learn from experience, adapt, and gather information. First, in "Feature World," a simple setting for testing information gathering, models performed near-optimally. However, to test more complex, multi-trial learning, we implemented a text-based version of the "Alchemy" environment, a benchmark for meta-learning. Here, agents must deduce a latent causal structure by integrating information across many trials. In this setting, recent foundation models initially failed to improve their performance over time. Crucially, we found that prompting the models to summarize their observations at regular intervals enabled an emergent meta-learning process. This allowed them to improve across trials and even adaptively re-learn when the environment's rules changed unexpectedly. While most models handled the simple task, Alchemy revealed stark differences in robustness: Gemini 2.5 performed best, followed by Claude 3.7, while ChatGPT-4o and o4-mini struggled. This underscores Alchemy's value as a benchmark. Our findings demonstrate that the biggest challenge for foundation models is not selecting informative actions in the moment, but integrating knowledge through adaptive strategies over time. Encouragingly, there appears to be no intrinsic barrier to future models mastering these abilities.
♻ ☆ Expected Free Energy-based Planning as Variational Inference
We address the problem of planning under uncertainty, where an agent must choose actions that not only achieve desired outcomes but also reduce uncertainty. Traditional methods often treat exploration and exploitation as separate objectives, lacking a unified inferential foundation. Active inference, grounded in the Free Energy Principle, provides such a foundation by minimizing Expected Free Energy (EFE), a cost function that combines utility with epistemic drives, such as ambiguity resolution and novelty seeking. However, the computational burden of EFE minimization had remained a significant obstacle to its scalability. In this paper, we show that EFE-based planning arises naturally from minimizing a variational free energy functional on a generative model augmented with preference and epistemic priors. This result reinforces theoretical consistency with the Free Energy Principle by casting planning under uncertainty itself as a form of variational inference. Our formulation yields policies that jointly support goal achievement and information gain, while incorporating a complexity term that accounts for bounded computational resources. This unifying framework connects and extends existing methods, enabling scalable, resource-aware implementations of active inference agents.
comment: 18 pages
♻ ☆ Constrained free energy minimization for the design of thermal states and stabilizer thermodynamic systems
A quantum thermodynamic system is described by a Hamiltonian and a list of conserved, non-commuting charges, and a fundamental goal is to determine the minimum energy of the system subject to constraints on the charges. Recently, [Liu et al., arXiv:2505.04514] proposed first- and second-order classical and hybrid quantum-classical algorithms for solving a dual chemical potential maximization problem, and they proved that these algorithms converge to global optima by means of gradient-ascent approaches. In this paper, we benchmark these algorithms on several problems of interest in thermodynamics, including one- and two-dimensional quantum Heisenberg models with nearest and next-to-nearest neighbor interactions and with the charges set to the total x, y, and z magnetizations. We also offer an alternative compelling interpretation of these algorithms as methods for designing ground and thermal states of controllable Hamiltonians, with potential applications in molecular and material design. Furthermore, we introduce stabilizer thermodynamic systems as thermodynamic systems based on stabilizer codes, with the Hamiltonian constructed from a given code's stabilizer operators and the charges constructed from the code's logical operators. We benchmark the aforementioned algorithms on several examples of stabilizer thermodynamic systems, including those constructed from the one-to-three-qubit repetition code, the perfect one-to-five-qubit code, and the two-to-four-qubit error-detecting code. Finally, we observe that the aforementioned hybrid quantum-classical algorithms, when applied to stabilizer thermodynamic systems, can serve as alternative methods for encoding qubits into stabilizer codes at a fixed temperature, and we provide an effective method for warm-starting these encoding algorithms whenever a single qubit is encoded into multiple physical qubits.
comment: v2: 35 pages, 12 figures, updated simulations
♻ ☆ Interpretable Clustering: A Survey
In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent. For convenient access and reference, an open repository organizes representative and emerging interpretable clustering methods under the taxonomy proposed in this survey, available at https://github.com/hulianyu/Awesome-Interpretable-Clustering
comment: 14 pages, 2 figures, 3 tables
♻ ☆ AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm
We present AuToMATo, a novel clustering algorithm based on persistent homology. While AuToMATo is not parameter-free per se, we provide default choices for its parameters that make it into an out-of-the-box clustering algorithm that performs well across the board. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo (with its parameters fixed to their defaults) against many other state-of-the-art clustering algorithms. We find not only that AuToMATo compares favorably against parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a clustering algorithm that does not need tuning of its parameters. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.
comment: Code: https://doi.org/10.5281/zenodo.17279740
♻ ☆ Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks
Backdoor attacks have become a significant threat to the pre-training and deployment of deep neural networks (DNNs). Although numerous methods for detecting and mitigating backdoor attacks have been proposed, most rely on identifying and eliminating the ``shortcut" created by the backdoor, which links a specific source class to a target class. However, these approaches can be easily circumvented by designing multiple backdoor triggers that create shortcuts everywhere and therefore nowhere specific. In this study, we explore the concept of Multi-Trigger Backdoor Attacks (MTBAs), where multiple adversaries leverage different types of triggers to poison the same dataset. By proposing and investigating three types of multi-trigger attacks including \textit{parallel}, \textit{sequential}, and \textit{hybrid} attacks, we demonstrate that 1) multiple triggers can coexist, overwrite, or cross-activate one another, and 2) MTBAs easily break the prevalent shortcut assumption underlying most existing backdoor detection/removal methods, rendering them ineffective. Given the security risk posed by MTBAs, we have created a multi-trigger backdoor poisoning dataset to facilitate future research on detecting and mitigating these attacks, and we also discuss potential defense strategies against MTBAs. Our code is available at https://github.com/bboylyg/Multi-Trigger-Backdoor-Attacks.
comment: 13 pages
♻ ☆ Learning to Price Bundles: A GCN Approach for Mixed Bundling
Bundle pricing refers to designing several product combinations (i.e., bundles) and determining their prices in order to maximize the expected profit. It is a classic problem in revenue management and arises in many industries, such as e-commerce, tourism, and video games. However, the problem is typically intractable due to the exponential number of candidate bundles. In this paper, we explore the usage of graph convolutional networks (GCNs) in solving the bundle pricing problem. Specifically, we first develop a graph representation of the mixed bundling model (where every possible bundle is assigned with a specific price) and then train a GCN to learn the latent patterns of optimal bundles. Based on the trained GCN, we propose two inference strategies to derive high-quality feasible solutions. A local-search technique is further proposed to improve the solution quality. Numerical experiments validate the effectiveness and efficiency of our proposed GCN-based framework. Using a GCN trained on instances with 5 products, our methods consistently achieve near-optimal solutions (better than 97%) with only a fraction of computational time for problems of small to medium size. It also achieves superior solutions for larger size of problems compared with other heuristic methods such as bundle size pricing (BSP). The method can also provide high quality solutions for instances with more than 30 products even for the challenging cases where product utilities are non-additive.
♻ ☆ EntryPrune: Neural Network Feature Selection using First Impressions
There is an ongoing effort to develop feature selection algorithms to improve interpretability, reduce computational resources, and minimize overfitting in predictive models. Neural networks stand out as architectures on which to build feature selection methods, and recently, neuron pruning and regrowth have emerged from the sparse neural network literature as promising new tools. We introduce EntryPrune, a novel supervised feature selection algorithm using a dense neural network with a dynamic sparse input layer. It employs entry-based pruning, a novel approach that compares neurons based on their relative change induced when they have entered the network. Extensive experiments on 13 different datasets show that our approach generally outperforms the current state-of-the-art methods, and in particular improves the average accuracy on low-dimensional datasets. Furthermore, we show that EntryPruning surpasses traditional techniques such as magnitude pruning within the EntryPrune framework and that EntryPrune achieves lower runtime than competing approaches. Our code is available at https://github.com/flxzimmer/entryprune.
♻ ☆ SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-bench, we propose a framework to evaluate DA methods on diverse modalities, beyond computer vision task that have been largely explored in the literature. We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. SKADA-bench is available on Github at https://github.com/scikit-adaptation/skada-bench.
comment: Published in Transactions on Machine Learning Research
♻ ☆ Mutatis Mutandis: Revisiting the Comparator in Discrimination Testing
Testing for individual discrimination involves deriving a profile, the comparator, similar to the one making the discrimination claim, the complainant, based on a protected attribute, such as race or gender, and comparing their decision outcomes. The complainant-comparator pair is central to discrimination testing. Most discrimination testing tools rely on this pair to establish evidence for discrimination. In this work we revisit the role of the comparator in discrimination testing. We first argue for the inherent causal modeling nature of deriving the comparator. We then introduce a two-kinds classification for the comparator: the ceteris paribus, or``with all else equal,'' (CP) comparator and the mutatis mutandis, or ``with the appropriate adjustments being made,'' (MM) comparator. The CP comparator is the standard comparator, representing an idealized comparison for establishing discrimination as it aims for a complainant-comparator pair that only differs on membership to the protected attribute. As an alternative to it, we define the MM comparator, which requires that the comparator represents the``what would have been of'' the complainant without the effects of the protected attribute on the non-protected attributes. Under the MM comparator, the complainant-comparator pair can be dissimilar in terms of the non-protected attributes, departing from an idealized comparison. Notably, the MM comparator is a more complex kind of comparator and its implementation offers an impactful venue for machine learning methods. We illustrate these two comparators and their impact on discrimination testing using a real-world example.
♻ ☆ Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, for words with a pronunciation-orthography mismatch, these methods may still struggle. We propose a method which allows corrections of substitution errors to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate of up to 8%, while maintaining a competitive overall word error rate.
♻ ☆ Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms
Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS
comment: 54 Pages
♻ ☆ SDFs from Unoriented Point Clouds using Neural Variational Heat Distances
We propose a novel variational approach for computing neural Signed Distance Fields (SDF) from unoriented point clouds. To this end, we replace the commonly used eikonal equation with the heat method, carrying over to the neural domain what has long been standard practice for computing distances on discrete surfaces. This yields two convex optimization problems for whose solution we employ neural networks: We first compute a neural approximation of the gradients of the unsigned distance field through a small time step of heat flow with weighted point cloud densities as initial data. Then we use it to compute a neural approximation of the SDF. We prove that the underlying variational problems are well-posed. Through numerical experiments, we demonstrate that our method provides state-of-the-art surface reconstruction and consistent SDF gradients. Furthermore, we show in a proof-of-concept that it is accurate enough for solving a PDE on the zero-level set.
comment: 15 pages, 17 figures, 4 tables
♻ ☆ Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments
This is the preprint version of the article published in IEEE Access vol. 13, pp. 146503--146526, 2025, doi:10.1109/ACCESS.2025.3599541. Please cite the published version. Urban air pollution remains a pressing global concern, particularly in densely populated and traffic-intensive metropolitan areas like Delhi, where exposure to harmful pollutants severely impacts public health. Delhi, being one of the most polluted cities globally, experiences chronic air quality issues due to vehicular emissions, industrial activities, and construction dust, which exacerbate its already fragile atmospheric conditions. Traditional pollution mitigation strategies, such as static air purifying installations, often fail to maximize their impact due to suboptimal placement and limited adaptability to dynamic urban environments. This study presents a novel deep reinforcement learning (DRL) framework to optimize the placement of air purification booths to improve the air quality index (AQI) in the city of Delhi. We employ Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm, to iteratively learn and identify high-impact locations based on multiple spatial and environmental factors, including population density, traffic patterns, industrial influence, and green space constraints. Our approach is benchmarked against conventional placement strategies, including random and greedy AQI-based methods, using multi-dimensional performance evaluation metrics such as AQI improvement, spatial coverage, population and traffic impact, and spatial entropy.
comment: This is the preprint version of the article published in IEEE Access vol. 13, pp. 146503--146526, 2025, doi:10.1109/ACCESS.2025.3599541. Please cite the published version
♻ ☆ From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning
Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct, particularly after fine-tuning. This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique challenges inherent to both rule-based and model-based verifiers and provide insights toward developing more accurate and robust reward systems for reinforcement learning.
♻ ☆ Conformal Prediction in Hierarchical Classification with Constrained Representation Complexity
Conformal prediction has emerged as a widely used framework for constructing valid prediction sets in classification and regression tasks. In this work, we extend the split conformal prediction framework to hierarchical classification, where prediction sets are commonly restricted to internal nodes of a predefined hierarchy, and propose two computationally efficient inference algorithms. The first algorithm returns internal nodes as prediction sets, while the second one relaxes this restriction. Using the notion of representation complexity, the latter yields smaller set sizes at the cost of a more general and combinatorial inference problem. Empirical evaluations on several benchmark datasets demonstrate the effectiveness of the proposed algorithms in achieving nominal coverage.
♻ ☆ Detecting Invariant Manifolds in ReLU-Based RNNs
Recurrent Neural Networks (RNNs) have found widespread applications in machine learning for time series prediction and dynamical systems reconstruction, and experienced a recent renaissance with improved training algorithms and architectural designs. Understanding why and how trained RNNs produce their behavior is important for scientific and medical applications, and explainable AI more generally. An RNN's dynamical repertoire depends on the topological and geometrical properties of its state space. Stable and unstable manifolds of periodic points play a particularly important role: They dissect a dynamical system's state space into different basins of attraction, and their intersections lead to chaotic dynamics with fractal geometry. Here we introduce a novel algorithm for detecting these manifolds, with a focus on piecewise-linear RNNs (PLRNNs) employing rectified linear units (ReLUs) as their activation function. We demonstrate how the algorithm can be used to trace the boundaries between different basins of attraction, and hence to characterize multistability, a computationally important property. We further show its utility in finding so-called homoclinic points, the intersections between stable and unstable manifolds, and thus establish the existence of chaos in PLRNNs. Finally we show for an empirical example, electrophysiological recordings from a cortical neuron, how insights into the underlying dynamics could be gained through our method.
♻ ☆ DeepBoost-AF: A Novel Unsupervised Feature Learning and Gradient Boosting Fusion for Robust Atrial Fibrillation Detection in Raw ECG Signals
Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with elevated health risks, where timely detection is pivotal for mitigating stroke-related morbidity. This study introduces an innovative hybrid methodology integrating unsupervised deep learning and gradient boosting models to improve AF detection. A 19-layer deep convolutional autoencoder (DCAE) is coupled with three boosting classifiers-AdaBoost, XGBoost, and LightGBM (LGBM)-to harness their complementary advantages while addressing individual limitations. The proposed framework uniquely combines DCAE with gradient boosting, enabling end-to-end AF identification devoid of manual feature extraction. The DCAE-LGBM model attains an F1-score of 95.20%, sensitivity of 99.99%, and inference latency of four seconds, outperforming existing methods and aligning with clinical deployment requirements. The DCAE integration significantly enhances boosting models, positioning this hybrid system as a reliable tool for automated AF detection in clinical settings.
comment: 12-page,4 figures,3 tables, Achieves 95.20% F1-score (99.99% sensitivity) on 8,528 PhysioNet 2017 recordings, Mean inference time: 4 seconds, Python implementation will be open-sourced upon publication
♻ ☆ Object Centric Concept Bottlenecks
Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.
♻ ☆ Conditional Local Independence Testing for Itô processes with Applications to Dynamic Causal Discovery
Inferring causal relationships from dynamical systems is the central interest of many scientific inquiries. Conditional local independence, which describes whether the evolution of one process is influenced by another process given additional processes, is important for causal learning in such systems. In this paper, we propose a hypothesis test for conditional local independence in It\^o processes. Our test is grounded in the semimartingale decomposition of the It\^o process, with which we introduce a stochastic integral process that is a martingale under the null hypothesis. We then apply a test for the martingale property, quantifying potential deviation from local independence. The test statistics is estimated using the optimal filtering equation. We show the consistency of the estimation, thereby establishing the level and power of our test. Numerical verification and a real-world application to causal discovery in brain resting-state fMRIs are conducted.
comment: Preprint
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97\%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ BenchAgents: Multi-Agent Systems for Structured Benchmark Creation
Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate capabilities related to planning, constraint satisfaction, and causal reasoning spanning both language and vision modalities. We then use these benchmarks to study state-of-the-art models and extract new insights into common failure modes and model differences.
♻ ☆ Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.
♻ ☆ Bypassing Prompt Guards in Production with Controlled-Release Prompting
As large language models (LLMs) advance, ensuring AI safety and alignment is paramount. One popular approach is prompt guards, lightweight mechanisms designed to filter malicious queries while being easy to implement and update. In this work, we introduce a new attack that circumvents such prompt guards, highlighting their limitations. Our method consistently jailbreaks production models while maintaining response quality, even under the highly protected chat interfaces of Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok (3), and Mistral Le Chat (Magistral). The attack exploits a resource asymmetry between the prompt guard and the main LLM, encoding a jailbreak prompt that lightweight guards cannot decode but the main model can. This reveals an attack surface inherent to lightweight prompt guards in modern LLM architectures and underscores the need to shift defenses from blocking malicious inputs to preventing malicious outputs. We additionally identify other critical alignment issues, such as copyrighted data extraction, training data extraction, and malicious response leakage during thinking.
♻ ☆ GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.
comment: 25 pages, 10 tables, 3 figures
♻ ☆ A Middle Path for On-Premises LLM Deployment: Preserving Privacy Without Sacrificing Model Confidentiality
Privacy-sensitive users require deploying large language models (LLMs) within their own infrastructure (on-premises) to safeguard private data and enable customization. However, vulnerabilities in local environments can lead to unauthorized access and potential model theft. To address this, prior research on small models has explored securing only the output layer within hardware-secured devices to balance model confidentiality and customization. Yet this approach fails to protect LLMs effectively. In this paper, we discover that (1) query-based distillation attacks targeting the secured top layer can produce a functionally equivalent replica of the victim model; (2) securing the same number of layers, bottom layers before a transition layer provide stronger protection against distillation attacks than top layers, with comparable effects on customization performance; and (3) the number of secured layers creates a trade-off between protection and customization flexibility. Based on these insights, we propose SOLID, a novel deployment framework that secures a few bottom layers in a secure environment and introduces an efficient metric to optimize the trade-off by determining the ideal number of hidden layers. Extensive experiments on five models (1.3B to 70B parameters) demonstrate that SOLID outperforms baselines, achieving a better balance between protection and downstream customization.
comment: 8 pages for main content of the paper
♻ ☆ Geometry-Preserving Encoder/Decoder in Latent Generative Models
Generative modeling aims to generate new data samples that resemble a given dataset, with diffusion models recently becoming the most popular generative model. One of the main challenges of diffusion models is solving the problem in the input space, which tends to be very high-dimensional. Recently, solving diffusion models in the latent space through an encoder that maps from the data space to a lower-dimensional latent space has been considered to make the training process more efficient and has shown state-of-the-art results. The variational autoencoder (VAE) is the most commonly used encoder/decoder framework in this domain, known for its ability to learn latent representations and generate data samples. In this paper, we introduce a novel encoder/decoder framework with theoretical properties distinct from those of the VAE, specifically designed to preserve the geometric structure of the data distribution. We demonstrate the significant advantages of this geometry-preserving encoder in the training process of both the encoder and decoder. Additionally, we provide theoretical results proving convergence of the training process, including convergence guarantees for encoder training, and results showing faster convergence of decoder training when using the geometry-preserving encoder.
comment: 56 pages
♻ ☆ MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis
Deep learning algorithms require extensive data to achieve robust performance. However, data availability is often restricted in the medical domain due to patient privacy concerns. Synthetic data presents a possible solution to these challenges. Recently, image generative models have found increasing use for medical applications but are often designed for singular medical specialties and imaging modalities, thus limiting their broader utility. To address this, we introduce MediSyn: a text-guided, latent diffusion model capable of generating synthetic images from 6 medical specialties and 10 image types. Through extensive experimentation, we first demonstrate that MediSyn quantitatively matches or surpasses the performance of specialist models. Second, we show that our synthetic images are realistic and exhibit strong alignment with their corresponding text prompts, as validated by a team of expert physicians. Third, we provide empirical evidence that our synthetic images are visually distinct from their corresponding real patient images. Finally, we demonstrate that in data-limited settings, classifiers trained solely on synthetic data or real data supplemented with synthetic data can outperform those trained solely on real data. Our findings highlight the immense potential of generalist image generative models to accelerate algorithmic research and development in medicine.
♻ ☆ LLM Unlearning Without an Expert Curated Dataset
Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.
♻ ☆ DUA-D2C: Dynamic Uncertainty Aware Method for Overfitting Remediation in Deep Learning
Overfitting remains a significant challenge in deep learning, often arising from data outliers, noise, and limited training data. To address this, the Divide2Conquer (D2C) method was previously proposed, which partitions training data into multiple subsets and trains identical models independently on each. This strategy enables learning more consistent patterns while minimizing the influence of individual outliers and noise. However, D2C's standard aggregation typically treats all subset models equally or based on fixed heuristics (like data size), potentially underutilizing information about their varying generalization capabilities. Building upon this foundation, we introduce Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C), an advanced technique that refines the aggregation process. DUA-D2C dynamically weights the contributions of subset models based on their performance on a shared validation set, considering both accuracy and prediction uncertainty. This intelligent aggregation allows the central model to preferentially learn from subsets yielding more generalizable and confident edge models, thereby more effectively combating overfitting. Empirical evaluations on benchmark datasets spanning multiple domains demonstrate that DUA-D2C significantly improves generalization. Our analysis includes evaluations of decision boundaries, loss curves, and other performance metrics, highlighting the effectiveness of DUA-D2C. This study demonstrates that DUA-D2C improves generalization performance even when applied on top of other regularization methods, establishing it as a theoretically grounded and effective approach to combating overfitting in modern deep learning. Our codes are publicly available at: https://github.com/Saiful185/DUA-D2C.
comment: This version (v2) extends our previous work (arXiv:2411.15876v1) on Divide2Conquer (D2C) by introducing Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C). The manuscript is currently under review at Complex and Intelligent Systems
♻ ☆ Learning Exposure Mapping Functions for Inferring Heterogeneous Peer Effects
In causal inference, interference refers to the phenomenon in which the actions of peers in a network can influence an individual's outcome. Peer effect refers to the difference in counterfactual outcomes of an individual for different levels of peer exposure, the extent to which an individual is exposed to the treatments, actions, or behaviors of peers. Estimating peer effects requires deciding how to represent peer exposure. Typically, researchers define an exposure mapping function that aggregates peer treatments and outputs peer exposure. Most existing approaches for defining exposure mapping functions assume peer exposure based on the number or fraction of treated peers. Recent studies have investigated more complex functions of peer exposure which capture that different peers can exert different degrees of influence. However, none of these works have explicitly considered the problem of automatically learning the exposure mapping function. In this work, we focus on learning this function for the purpose of estimating heterogeneous peer effects, where heterogeneity refers to the variation in counterfactual outcomes for the same peer exposure but different individual's contexts. We develop EgoNetGNN, a graph neural network (GNN)-based method, to automatically learn the appropriate exposure mapping function allowing for complex peer influence mechanisms that, in addition to peer treatments, can involve the local neighborhood structure and edge attributes. We show that GNN models that use peer exposure based on the number or fraction of treated peers or learn peer exposure naively face difficulty accounting for such influence mechanisms. Our comprehensive evaluation on synthetic and semi-synthetic network data shows that our method is more robust to different unknown underlying influence mechanisms when estimating heterogeneous peer effects when compared to state-of-the-art baselines.
♻ ☆ Risk Profiling and Modulation for LLMs
Large language models (LLMs) are increasingly used for decision-making tasks under uncertainty; however, their risk profiles and how they are influenced by prompting and alignment methods remain underexplored. Existing studies have primarily examined personality prompting or multi-agent interactions, leaving open the question of how post-training influences the risk behavior of LLMs. In this work, we propose a new pipeline for eliciting, steering, and modulating LLMs' risk profiles, drawing on tools from behavioral economics and finance. Using utility-theoretic models, we compare pre-trained, instruction-tuned, and RLHF-aligned LLMs, and find that while instruction-tuned models exhibit behaviors consistent with some standard utility formulations, pre-trained and RLHF-aligned models deviate more from any utility models fitted. We further evaluate modulation strategies, including prompt engineering, in-context learning, and post-training, and show that post-training provides the most stable and effective modulation of risk preference. Our findings provide insights into the risk profiles of different classes and stages of LLMs and demonstrate how post-training modulates these profiles, laying the groundwork for future research on behavioral alignment and risk-aware LLM design.
♻ ☆ End-to-End Training of High-Dimensional Optimal Control with Implicit Hamiltonians via Jacobian-Free Backpropagation
Neural network approaches that parameterize value functions have succeeded in approximating high-dimensional optimal feedback controllers when the Hamiltonian admits explicit formulas. However, many practical problems, such as the space shuttle reentry problem and bicycle dynamics, among others, may involve implicit Hamiltonians that do not admit explicit formulas, limiting the applicability of existing methods. Rather than directly parameterizing controls, which does not leverage the Hamiltonian's underlying structure, we propose an end-to-end implicit deep learning approach that directly parameterizes the value function to learn optimal control laws. Our method enforces physical principles by ensuring trained networks adhere to the control laws by exploiting the fundamental relationship between the optimal control and the value function's gradient; this is a direct consequence of the connection between Pontryagin's Maximum Principle and dynamic programming. Using Jacobian-Free Backpropagation (JFB), we achieve efficient training despite temporal coupling in trajectory optimization. We show that JFB produces descent directions for the optimal control objective and experimentally demonstrate that our approach effectively learns high-dimensional feedback controllers across multiple scenarios involving implicit Hamiltonians, which existing methods cannot address.
♻ ☆ Distilled Protein Backbone Generation
Diffusion- and flow-based generative models have recently demonstrated strong performance in protein backbone generation tasks, offering unprecedented capabilities for de novo protein design. However, while achieving notable performance in generation quality, these models are limited by their generating speed, often requiring hundreds of iterative steps in the reverse-diffusion process. This computational bottleneck limits their practical utility in large-scale protein discovery, where thousands to millions of candidate structures are needed. To address this challenge, we explore the techniques of score distillation, which has shown great success in reducing the number of sampling steps in the vision domain while maintaining high generation quality. However, a straightforward adaptation of these methods results in unacceptably low designability. Through extensive study, we have identified how to appropriately adapt Score identity Distillation (SiD), a state-of-the-art score distillation strategy, to train few-step protein backbone generators which significantly reduce sampling time, while maintaining comparable performance to their pretrained teacher model. In particular, multistep generation combined with inference time noise modulation is key to the success. We demonstrate that our distilled few-step generators achieve more than a 20-fold improvement in sampling speed, while achieving similar levels of designability, diversity, and novelty as the Proteina teacher model. This reduction in inference cost enables large-scale in silico protein design, thereby bringing diffusion-based models closer to real-world protein engineering applications.
♻ ☆ Integrating Feature Selection and Machine Learning for Nitrogen Assessment in Grapevine Leaves using In-Field Hyperspectral Imaging
Nitrogen (N) is one of the most crucial nutrients in vineyards, affecting plant growth and subsequent products such as wine and juice. Because soil N has high spatial and temporal variability, it is desirable to accurately estimate the N concentration of grapevine leaves and manage fertilization at the individual plant level to optimally meet plant needs. In this study, we used in-field hyperspectral images with wavelengths ranging from $400 to 1000nm of four different grapevine cultivars collected from distinct vineyards and over two growth stages during two growing seasons to develop models for predicting N concentration at the leaf-level and canopy-level. After image processing, two feature selection methods were employed to identify the optimal set of spectral bands that were responsive to leaf N concentrations. The selected spectral bands were used to train and test two different Machine Learning (ML) models, Gradient Boosting and XGBoost, for predicting nitrogen concentrations. The comparison of selected bands for both leaf-level and canopy-level datasets showed that most of the spectral regions identified by the feature selection methods were across both methods and the dataset types (leaf- and canopy-level datasets), particularly in the key regions, 500-525nm, 650-690nm, 750-800nm, and 900-950nm. These findings indicated the robustness of these spectral regions for predicting nitrogen content. The results for N prediction demonstrated that the ML model achieved an R square of 0.49 for canopy-level data and an R square of 0.57 for leaf-level data, despite using different sets of selected spectral bands for each analysis level. The study demonstrated the potential of using in-field hyperspectral imaging and the use of spectral data in integrated feature selection and ML techniques to monitor N status in vineyards.
comment: Major Revision
♻ ☆ RainSeer: Fine-Grained Rainfall Reconstruction via Physics-Guided Modeling
Reconstructing high-resolution rainfall fields is essential for flood forecasting, hydrological modeling, and climate analysis. However, existing spatial interpolation methods-whether based on automatic weather station (AWS) measurements or enhanced with satellite/radar observations often over-smooth critical structures, failing to capture sharp transitions and localized extremes. We introduce RainSeer, a structure-aware reconstruction framework that reinterprets radar reflectivity as a physically grounded structural prior-capturing when, where, and how rain develops. This shift, however, introduces two fundamental challenges: (i) translating high-resolution volumetric radar fields into sparse point-wise rainfall observations, and (ii) bridging the physical disconnect between aloft hydro-meteors and ground-level precipitation. RainSeer addresses these through a physics-informed two-stage architecture: a Structure-to-Point Mapper performs spatial alignment by projecting mesoscale radar structures into localized ground-level rainfall, through a bidirectional mapping, and a Geo-Aware Rain Decoder captures the semantic transformation of hydro-meteors through descent, melting, and evaporation via a causal spatiotemporal attention mechanism. We evaluate RainSeer on two public datasets-RAIN-F (Korea, 2017-2019) and MeteoNet (France, 2016-2018)-and observe consistent improvements over state-of-the-art baselines, reducing MAE by over 13.31% and significantly enhancing structural fidelity in reconstructed rainfall fields.
♻ ☆ Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking NeurIPS 2025
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
comment: Accepted at NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Exchangeability in Neural Network and its Application to Dynamic Pruning
Modern neural networks (NN) contain an ever-growing number of parameters, substantially increasing the memory and computational cost of inference. Researchers have explored various ways to reduce the inference cost of NNs by reducing the model size before deployment and dynamically pruning the inference computation at runtime. In this work, we present ExPrune, a general, dynamic pruning optimization that enables multi-granularity partial computation on a per-input basis. ExPrune requires no change to the model architecture or the training algorithm. ExPrune is based on our theoretical results that the relationship between certain model parameters and intermediate values can be described by a statistical property called exchangeability. By identifying exchangeable parameters and values in the model, we are able to first partially evaluate the network, analyze the statistics of the partial results, and make pruning decisions on the fly. Because ExPrune is theory grounded, it generalizes across model architectures in different problem domains. We evaluate ExPrune on one computer vision models, one graph model and one language model. ExPrune provides 10.98--17.33% reduction in FLOPs with negligible accuracy drop and 21.61--27.16% reduction in FLOPs with at most 1% accuracy drop. We also demonstrate that ExPrune composes with static magnitude pruning. On models that have been aggressively statically pruned, ExPrune still provides additional 10.24--11.11% reduction in FLOPs with negligible accuracy drop and 13.91--14.39% reduction in FLOPs with at most 1% accuracy drop.
♻ ☆ On residual network depth
Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.
♻ ☆ Longitudinal Flow Matching for Trajectory Modeling
Generative models for sequential data often struggle with sparsely sampled and high-dimensional trajectories, typically reducing the learning of dynamics to pairwise transitions. We propose Interpolative Multi-Marginal Flow Matching (IMMFM), a framework that learns continuous stochastic dynamics jointly consistent with multiple observed time points. IMMFM employs a piecewise-quadratic interpolation path as a smooth target for flow matching and jointly optimizes drift and a data-driven diffusion coefficient, supported by a theoretical condition for stable learning. This design captures intrinsic stochasticity, handles irregular sparse sampling, and yields subject-specific trajectories. Experiments on synthetic benchmarks and real-world longitudinal neuroimaging datasets show that IMMFM outperforms existing methods in both forecasting accuracy and further downstream tasks.
♻ ☆ Testing Support Size More Efficiently Than Learning Histograms
Consider two problems about an unknown probability distribution $p$: 1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$\epsilon$-far" (in total variation distance) from being supported on $n$ elements. 2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $\Theta(\tfrac{n}{\epsilon^2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{\epsilon \log n} \log(1/\epsilon))$ samples, nearly matching the best known lower bound of $\Omega(\tfrac{n}{\epsilon \log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.
comment: 40 pages. Appeared at STOC 2025
♻ ☆ MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding NeurIPS 2025
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.
comment: Accepted to NeurIPS 2025
♻ ☆ Generalizable Physics-Informed Learning for Stochastic Safety-Critical Systems
Accurate estimation of long-term risk is essential for the design and analysis of stochastic dynamical systems. Existing risk quantification methods typically rely on extensive datasets involving risk events observed over extended time horizons, which can be prohibitively expensive to acquire. Motivated by this gap, we propose an efficient method for learning long-term risk probabilities using short-term samples with limited occurrence of risk events. Specifically, we establish that four distinct classes of long-term risk probabilities are characterized by specific partial differential equations (PDEs). Using this characterization, we introduce a physics-informed learning framework that combines empirical data with physics information to infer risk probabilities. We then analyze the theoretical properties of this framework in terms of generalization and convergence. Through numerical experiments, we demonstrate that our framework not only generalizes effectively beyond the sampled states and time horizons but also offers additional benefits such as improved sample efficiency, rapid online inference capabilities under changing system dynamics, and stable computation of probability gradients. These results highlight how embedding PDE constraints, which contain explicit gradient terms and inform how risk probabilities depend on state, time horizon, and system parameters, improves interpolation and generalization between/beyond the available data.
♻ ☆ SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications NeurIPS 2025
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3$\times$, outperforming state-of-the-art methods -- 2.8$\times$ faster than model-based approaches like EAGLE-2/3 and 1.9$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference
comment: NeurIPS 2025 (Spotlight)
♻ ☆ Attacking the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples
Spiking neural networks (SNNs) have drawn much attention for their high energy efficiency and recent advances in classification performance. However, unlike traditional deep learning, the robustness of SNNs to adversarial examples remains underexplored. This work advances the adversarial attack side of SNNs and makes three major contributions. First, we show that successful white-box attacks on SNNs strongly depend on the surrogate gradient estimation technique, even for adversarially trained models. Second, using the best single surrogate gradient estimator, we study the transferability of adversarial examples between SNNs and state-of-the-art architectures such as Vision Transformers (ViTs) and CNNs. Our analysis reveals two major gaps: no existing white-box attack leverages multiple surrogate estimators, and no single attack effectively fools both SNNs and non-SNN models simultaneously. Third, we propose the Mixed Dynamic Spiking Estimation (MDSE) attack, which dynamically combines multiple surrogate gradients to overcome these gaps. MDSE produces adversarial examples that fool both SNN and non-SNN models, achieving up to 91.4% higher effectiveness on SNN/ViT ensembles and a 3x boost on adversarially trained SNN ensembles over Auto-PGD. Experiments span three datasets (CIFAR-10, CIFAR-100, ImageNet) and nineteen classifiers, and we will release code and models upon publication.
comment: Accepted manuscript. Published in *Neurocomputing*, Volume 656, 2025, Article 131506. Available online 12 September 2025. DOI: 10.1016/j.neucom.2025.131506
♻ ☆ Adversarial Surrogate Risk Bounds for Binary Classification
A central concern in classification is the vulnerability of machine learning models to adversarial attacks. Adversarial training is one of the most popular techniques for training robust classifiers, which involves minimizing an adversarial surrogate risk. Recent work has characterized the conditions under which any sequence minimizing the adversarial surrogate risk also minimizes the adversarial classification risk in the binary setting, a property known as adversarial consistency. However, these results do not address the rate at which the adversarial classification risk approaches its optimal value along such a sequence. This paper provides surrogate risk bounds that quantify that convergence rate.
comment: 37 pages, 3 figures
♻ ☆ The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs
Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.
comment: MRL Workshop at EMNLP 2025
♻ ☆ Weight Ensembling Improves Reasoning in Language Models
We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.
♻ ☆ LLM Unlearning via Neural Activation Redirection
The ability to selectively remove knowledge from LLMs is highly desirable. However, existing methods often struggle with balancing unlearning efficacy and retain model utility, and lack controllability at inference time to emulate base model behavior as if it had never seen the unlearned data. In this paper, we propose LUNAR, a novel unlearning method grounded in the Linear Representation Hypothesis and operates by redirecting the representations of unlearned data to activation regions that expresses its inability to answer. We show that contrastive features are not a prerequisite for effective activation redirection, and LUNAR achieves state-of-the-art unlearning performance and superior controllability. Specifically, LUNAR achieves between 2.9x and 11.7x improvement in the combined unlearning efficacy and model utility score (Deviation Score) across various base models and generates coherent, contextually appropriate responses post-unlearning. Moreover, LUNAR effectively reduces parameter updates to a single down-projection matrix, a novel design that significantly enhances efficiency by 20x and robustness. Finally, we demonstrate that LUNAR is robust to white-box adversarial attacks and versatile in real-world scenarios, including handling sequential unlearning requests.
♻ ☆ A Novel Collaborative Framework for Efficient Synchronization in Split Federated Learning over Wireless Networks
Split Federated Learning (SFL) offers a promising approach for distributed model training in wireless networks, combining the layer-partitioning advantages of split learning with the federated aggregation that ensures global convergence. However, in heterogeneous wireless environments, disparities in device capabilities and channel conditions make strict round-based synchronization heavily straggler-dominated, thereby limiting both efficiency and scalability. To address this challenge, we propose a new framework, called Collaborative Split Federated Learning (CSFL), that redefines workload redistribution through device-to-device collaboration. Building on the flexibility of model partitioning, CSFL enables efficient devices, after completing their own forward propagation, to seamlessly take over the unfinished layers of bottleneck devices. This collaborative process, supported by D2D communications, allows bottleneck devices to offload computation earlier while maintaining synchronized progression across the network. Beyond the system design, we highlight key technical enablers such as privacy protection, multi-perspective matching, and incentive mechanisms, and discuss practical challenges including matching balance, privacy risks, and incentive sustainability. A case study demonstrates that CSFL significantly reduces training latency without compromising convergence speed or accuracy, underscoring collaboration as a key enabler for synchronization-efficient learning in next-generation wireless networks.
♻ ☆ Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization
Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.
comment: 15 pages, 7 figures
♻ ☆ Multi-Agent Stage-wise Conservative Linear Bandits
In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-\alpha)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|\lambda_2|)}}\right)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|\lambda_2|$ is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.
♻ ☆ Baseline Systems For The 2025 Low-Resource Audio Codec Challenge
The Low-Resource Audio Codec (LRAC) Challenge aims to advance neural audio coding for deployment in resource-constrained environments. The first edition focuses on low-resource neural speech codecs that must operate reliably under everyday noise and reverberation, while satisfying strict constraints on computational complexity, latency, and bitrate. Track 1 targets transparency codecs, which aim to preserve the perceptual transparency of input speech under mild noise and reverberation. Track 2 addresses enhancement codecs, which combine coding and compression with denoising and dereverberation. This paper presents the official baseline systems for both tracks in the 2025 LRAC Challenge. The baselines are convolutional neural codec models with Residual Vector Quantization, trained end-to-end using a combination of adversarial and reconstruction objectives. We detail the data filtering and augmentation strategies, model architectures, optimization procedures, and checkpoint selection criteria.
comment: Low-Resource Audio Codec Challenge 2025
♻ ☆ Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.
♻ ☆ Performance of machine-learning-assisted Monte Carlo in sampling from simple statistical physics models
Recent years have seen a rise in the application of machine learning techniques to aid the simulation of hard-to-sample systems that cannot be studied using traditional methods. Despite the introduction of many different architectures and procedures, a wide theoretical understanding is still lacking, with the risk of suboptimal implementations. As a first step to address this gap, we provide here a complete analytic study of the widely-used Sequential Tempering procedure applied to a shallow MADE architecture for the Curie-Weiss model. The contribution of this work is twofold: firstly, we give a description of the optimal weights and of the training under Gradient Descent optimization. Secondly, we compare what happens in Sequential Tempering with and without the addition of local Metropolis Monte Carlo steps. We are thus able to give theoretical predictions on the best procedure to apply in this case. This work establishes a clear theoretical basis for the integration of machine learning techniques into Monte Carlo sampling and optimization.
comment: 17 pages, 10 figures
♻ ☆ A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks
Pairwise distance-based costs are crucial for self-supervised and contrastive feature learning. Mixture Density Networks (MDNs) are a widely used approach for generative models and density approximation, using neural networks to produce multiple centers that define a Gaussian mixture. By combining MDNs with contrastive costs, this paper proposes data density approximation using four types of kernelized matrix costs in the Hilbert space: the scalar cost, the vector-matrix cost, the matrix-matrix cost (the trace of Schur complement), and the SVD cost (the nuclear norm), for learning multiple centers required to define a mixture density.
♻ ☆ Bayesian Distributional Models of Executive Functioning
This study uses controlled simulations with known ground-truth parameters to evaluate how Distributional Latent Variable Models (DLVM) and Bayesian Distributional Active LEarning (DALE) perform in comparison to conventional Independent Maximum Likelihood Estimation (IMLE). DLVM integrates observations across multiple executive function tasks and individuals, allowing parameter estimation even under sparse or incomplete data conditions. DLVM consistently outperformed IMLE, especially under with smaller amounts of data, and converges faster to highly accurate estimates of the true distributions. In a second set of analyses, DALE adaptively guided sampling to maximize information gain, outperforming random sampling and fixed test batteries, particularly within the first 80 trials. These findings establish the advantages of combining DLVM's cross-task inference with DALE's optimal adaptive sampling, providing a principled basis for more efficient cognitive assessments.
comment: 42 pages, 8 figures, 1 table
♻ ☆ Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning Curves
We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample ($m$) used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being $d$-dimensional Gaussian. We operate in a regime where the dimension $d$, number of data samples $n$, and number of features $p$ tend to infinity while keeping the ratios $\psi_n=\frac{n}{d}$ and $\psi_p=\frac{p}{d}$ fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of $\psi_n,\psi_p$, and $m$. Our theoretical findings are consistent with the empirical observations.
comment: 8 pages
♻ ☆ Auto-Compressing Networks NeurIPS 2025
Deep neural networks with short residual connections have demonstrated remarkable success across domains, but increasing depth often introduces computational redundancy without corresponding improvements in representation quality. We introduce Auto-Compressing Networks (ACNs), an architectural variant where additive long feedforward connections from each layer to the output replace traditional short residual connections. By analyzing the distinct dynamics induced by this modification, we reveal a unique property we coin as auto-compression, the ability of a network to organically compress information during training with gradient descent, through architectural design alone. Through auto-compression, information is dynamically "pushed" into early layers during training, enhancing their representational quality and revealing potential redundancy in deeper ones. We theoretically show that this property emerges from layer-wise training patterns present in ACNs, where layers are dynamically utilized during training based on task requirements. We also find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, improved transfer learning capabilities, and mitigate catastrophic forgetting suggesting that they learn representations that generalize better despite using fewer parameters. Our results demonstrate up to 18% reduction in catastrophic forgetting and 30-80% architectural compression while maintaining accuracy across vision transformers, MLP-mixers, and BERT architectures. These findings establish ACNs as a practical approach to developing efficient neural architectures that automatically adapt their computational footprint to task complexity, while learning robust representations suitable for noisy real-world tasks and continual learning scenarios.
comment: NeurIPS 2025, 28 pages
♻ ☆ On the necessity of adaptive regularisation:Optimal anytime online learning on $\boldsymbol{\ell_p}$-balls
We study online convex optimization on $\ell_p$-balls in $\mathbb{R}^d$ for $p > 2$. While always sub-linear, the optimal regret exhibits a shift between the high-dimensional setting ($d > T$), when the dimension $d$ is greater than the time horizon $T$ and the low-dimensional setting ($d \leq T$). We show that Follow-the-Regularised-Leader (FTRL) with time-varying regularisation which is adaptive to the dimension regime is anytime optimal for all dimension regimes. Motivated by this, we ask whether it is possible to obtain anytime optimality of FTRL with fixed non-adaptive regularisation. Our main result establishes that for separable regularisers, adaptivity in the regulariser is necessary, and that any fixed regulariser will be sub-optimal in one of the two dimension regimes. Finally, we provide lower bounds which rule out sub-linear regret bounds for the linear bandit problem in sufficiently high-dimension for all $\ell_p$-balls with $p \geq 1$.
♻ ☆ Machine Learning H-theorem
H-theorem provides a microscopic foundation of the Second Law of Thermodynamics and is therefore essential to establishing statistical physics, but at the same time, H-theorem has been subject to controversy that in part persists till this day. To better understand H-theorem and its relation to the arrow of time, we study the equilibration of randomly oriented and positioned hard disks with periodic boundary conditions. Using a model based on the DeepSets architecture, which imposes permutation invariance of the particle labels, we train a model to capture the irreversibility of the H-functional.
♻ ☆ Probing forced responses and causal mechanisms in large-scale climate dynamics with reduced-order neural models
A central challenge in climate science and applied mathematics is developing data-driven models of multiscale systems that capture not only stationary statistics but also responses to external perturbations. Current neural climate emulators seek to resolve the atmosphere-ocean system in all its complexity but often fail to reproduce forced responses, limiting their use in causal studies such as Green's function experiments. Response theory offers a rigorous framework to explore these limitations, moving beyond stationary statistics, probing causal mechanisms, and guiding model design. Using a simplified multiscale system, we argue that the skill of purely data-driven models in reproducing perturbed statistics depends critically on (i) identifying the proper variables to model, including spatial and temporal scales, and (ii) accurate stochastic parameterizations of unresolved processes. For studies of low-frequency climate dynamics, these insights motivate reduced-order neural models, tailored to specific processes, as valuable alternatives to general-purpose emulators. We illustrate this perspective with a reduced-order neural stochastic model designed to investigate responses of radiative fluxes to surface temperature perturbations. The model largely reproduces stationary statistics and enables the study of forced responses in both ensemble mean and variance, shifting the study of climate feedbacks from single trajectories to responses of probability distributions. These results underscore reduced-order models, augmented with modern neural networks and evaluated within the framework of response theory, as a practical strategy for causal inference and attribution studies in large-scale climate dynamics.
♻ ☆ Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment
Multimodal contrastive learning (MCL) aims to embed data from different modalities in a shared embedding space. However, empirical evidence shows that representations from different modalities occupy completely separate regions of embedding space, a phenomenon referred to as the modality gap. Moreover, experimental findings on how the size of the modality gap influences downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) How does it affect downstream tasks? To address these questions, this paper introduces the first theoretical framework for analyzing the convergent optimal representations of MCL and the modality alignment when training is optimized. Specifically, we prove that without any constraint or under the cone constraint, the modality gap converges to zero. Under the subspace constraint (i.e., representations of two modalities fall into two distinct hyperplanes due to dimension collapse), the modality gap converges to the smallest angle between the two hyperplanes. This result identifies \emph{dimension collapse} as the fundamental origin of the modality gap. Furthermore, our theorems demonstrate that paired samples cannot be perfectly aligned under the subspace constraint. The modality gap influences downstream performance by affecting the alignment between sample pairs. We prove that, in this case, perfect alignment between two modalities can still be achieved via two ways: hyperplane rotation and shared space projection.
♻ ☆ MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices
The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.
comment: EMNLP 2025 (Findings), Project Page: https://kaist-dmlab.github.io/MONAQ
♻ ☆ Machine Learning and Multi-source Remote Sensing in Forest Aboveground Biomass Estimation: A Review
Quantifying forest aboveground biomass (AGB) is crucial for informing decisions and policies that will protect the planet. Machine learning (ML) and remote sensing (RS) techniques have been used to do this task more effectively, yet there lacks a systematic review on the most recent working combinations of ML methods and multiple RS sources, especially with the consideration of the forests' ecological characteristics. This study systematically analyzed 25 papers that met strict inclusion criteria from over 80 related studies, identifying all ML methods and combinations of RS data used. Random Forest had the most frequent appearance (88\% of studies), while Extreme Gradient Boosting showed superior performance in 75\% of the studies in which it was compared with other methods. Sentinel-1 emerged as the most utilized remote sensing source, with multi-sensor approaches (e.g., Sentinel-1, Sentinel-2, and LiDAR) proving especially effective. Our findings provide grounds for recommending which sensing sources, variables, and methods to consider using when integrating ML and RS for forest AGB estimation.
comment: First and corresponding author: Autumn Nguyen
♻ ☆ Efficient Flow Matching using Latent Variables
Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.